我刚刚开始学习Python,为了工作,我浏览了很多pdf,所以我找到了一个PDFMINER工具,可以将目录转换成文本文件。然后我做了下面的代码来告诉我这个pdf文件是一个被批准的声明还是一个被拒绝的声明。我不明白我怎么能说找到以“跟踪识别号…”开头的字符串,然后是18个字符,然后把它塞进一个数组?你知道吗
import os
import glob
import csv
def check(filename):
if 'DELIVERY NOTIFICATION' in open(filename).read():
isDenied = True
print ("This claim was Denied")
print (isDenied)
elif 'Dear Customer:' in open(filename).read():
isDenied = False
print("This claim was Approved")
print (isDenied)
else:
print("I don't know if this is approved or denied")
def iterate():
path = 'text/'
for infile in glob.glob(os.path.join(path, '*.txt')):
print ('current file is:' + infile)
filename = infile
check(filename)
iterate()
任何帮助都将不胜感激。这就是文本文件的样子
Shipper Number............................577140Pickup Date....................................06/27/17
Number of Parcels........................1Weight.............................................1 LBS
Shipper Invoice Number..............30057010Tracking Identification Number...1Z000000YW00000000
Merchandise..................................1 S NIKE EQUALS EVERYWHERE T BK B
WE HAVE BEEN UNABLE TO PROVIDE SATISFACTORY PROOF OF DELIVERY FOR THE ABOVE
SHIPMENT. WE APOLOGIZE FOR THE INCONVENIENCE THIS CAUSES.
NPT8AEQ:000A0000LDI 07
----------------Page (1) Break----------------
更新:许多有用的答案,这是我采取的路线,如果我自己这么说的话,效果相当不错。这会节省很多时间!!以下是我的全部代码,供将来的观众使用。你知道吗
import os
import glob
arrayDenied = []
def iterate():
path = 'text/'
for infile in glob.glob(os.path.join(path, '*.txt')):
print ('current file is:' + infile)
check(infile)
def check(filename):
with open(filename, 'rt') as file_contents:
myText = file_contents.read()
if 'DELIVERY NOTIFICATION' in myText:
start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[start : start+18]
print("Denied: " + myNumber)
arrayDenied.append(myNumber)
elif 'Dear Customer:' in open(filename).read():
print("This claim was Approved")
startTrackingNum = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[startTrackingNum : startTrackingNum+18]
startClaimNumberIndex = myText.index("Claim Number ") + len("Claim Number ")
myClaimNumber = myText[startClaimNumberIndex : startClaimNumberIndex+11]
arrayApproved.append(myNumber + " - " + myClaimNumber)
else:
print("I don't know if this is approved or denied")
iterate()
with open('Approved.csv', "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in arrayApproved:
writer.writerow([val])
with open('Denied.csv', "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in arrayDenied:
writer.writerow([val])
print(arrayDenied)
print(arrayApproved)
更新:添加了我完成的代码的其余部分,将列表写入CSV文件,在那里执行some=left()之类的命令,几分钟内我就有了1000个跟踪号码。这就是为什么编程是伟大的。你知道吗
如果您的目标只是找到“跟踪标识号…”字符串和随后的18个字符;您可以只找到该字符串的索引,然后到达它的结尾,并从该点开始切片,直到随后的18个字符的结尾。你知道吗
您还可以将append行修改为
arrayDenied.append(myText + ' ' + myNumber)
或类似的内容。你知道吗我认为这解决了你的问题,只是把它变成一个函数。你知道吗
如果您想阅读文档,请点击这里:https://docs.python.org/3/library/re.html#re.search
正则表达式是完成任务的方法。下面是一种修改代码以搜索模式的方法。你知道吗
解释:
r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"
(?<=Tracking Identification Number)
在捕获组后面查找字符串“Tracking Identification Number”(?:(\.+))
匹配一个或多个点(.
)(我们在后面去掉这些点)[A-Z-a-z0-9]{18}
匹配18个(大写或小写)字母或数字实例更多关于Regex。你知道吗
相关问题 更多 >
编程相关推荐