从文本文件中删除空行、空格、段落标记

2024-09-28 20:51:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我有文本文件样本数据如下

 E-RECEIPT FOR  TRANSFER FUNDS                                                                                                                                                                                                                                                                                                                                                                                                         

   Payee Name:                                                   AAA CHS                                                                                                                                                                                                                                                                                                                                                            

   Nickname:                                                     AAA CHS                                                                                                                                                                                                                                                                                                                                                            

   Credit Account No::                                           AAAA0000006666                                                                                                                                                                                                                                                                                                                                                         

   Remarks:                                                      4869                                                                                                                                                                                                                                                                                                                                                                    

   Debit Account:                                                99999999999999                                                                                                                                                                                                                                                                                                                                                         

   Date:                                                         05 May '20                                                                                                                                                                                                                                                                                                                                                              

   Amount:                                                       INR 4,869.00         (Rupees     Four Thousand Eight Hundred Sixty  Nine  and Zero Paisa only) 

如果我在word中看到此文件(文件-->;选项-->;显示-->;始终在屏幕上显示格式掩码,并选择它下面显示的所有选项)

 ....E-RECEIPT FOR  TRANSFER Of Funds...................................................................Payee Name...................
.....................................................................................................
AAA CHS.........................................................AAA CHS...........................Nickname ....etc 

Here (...) means spaces and in between lines it also shows paragraph symbols(¶) pillow cover and also at the end of file it is showing 3 paragraph symbols.

我希望输出类似(删除空格和段落符号)

E-RECEIPT FOR  TRANSFER FUNDS
Payee Name:                                                   AAA CHS 
Nickname:                                                     AAA CHS 
Credit Account No::                                           AAAA0000006666
...
...

我试着做了如下的事情

file=open("c:\\temp1\\tt1.txt", "r+")
for line in file.readlines():
    print(line.strip())
file.close()

它不起作用。请注意,我不想删除单词之间的空格,我想删除行之间的空格/特殊字符

第二,虽然这不是要求,但我可以在“:”或“:”前后只放一个空格吗

E-RECEIPT FOR  TRANSFER FUNDS
Payee Name : AAA CHS 
Nickname : AAA CHS 
Credit Account No :: AAAA0000006666

…等等


Tags: nonamefornicknameaccountfiletransfer空格
1条回答
网友
1楼 · 发布于 2024-09-28 20:51:13

使用此方便功能:

import re
def text_processor(s):
    # s = your text
    return '\n'.join(str.split(re.sub('\s{2,}', ' ', re.sub('\n\n', '|\n', s.replace('::',':'))), '|')).replace(':', ' :')

示例

# s = your text
# assuming you are reading in from a file: 'data.txt'
# with open('data.txt', 'r') as f:
#    s = f.read()
print(text_processor(s))

输出

E-RECEIPT FOR TRANSFER FUNDS 
 Payee Name : AAA CHS 
 Nickname : AAA CHS 
 Credit Account No : AAAA0000006666 
 Remarks : 4869 
 Debit Account : 99999999999999 
 Date : 05 May '20 
 Amount : INR 4,869.00 (Rupees Four Thousand Eight Hundred Sixty Nine and Zero Paisa only) 

虚拟数据

s = """
E-RECEIPT FOR  TRANSFER FUNDS                                                                                                                                                                                                                                                                                                                                                                                                         

   Payee Name:                                                   AAA CHS                                                                                                                                                                                                                                                                                                                                                            

   Nickname:                                                     AAA CHS                                                                                                                                                                                                                                                                                                                                                            

   Credit Account No::                                           AAAA0000006666                                                                                                                                                                                                                                                                                                                                                         

   Remarks:                                                      4869                                                                                                                                                                                                                                                                                                                                                                    

   Debit Account:                                                99999999999999                                                                                                                                                                                                                                                                                                                                                         

   Date:                                                         05 May '20                                                                                                                                                                                                                                                                                                                                                              

   Amount:                                                       INR 4,869.00         (Rupees     Four Thousand Eight Hundred Sixty  Nine  and Zero Paisa only) 
"""

print(s)

从Python打开Docx文件

参考:source

import docx2txt

# read in word file
s = docx2txt.process("data.docx")

# Copy pasting the dummy data into a docx file
# and trying to read and correcting the data 
# requires the following fix

print(text_processor(s).replace(' \n \n ', '\n'))

相关问题 更多 >