解析PDF后清理文本文件问题的回答

解析PDF后清理文本文件

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我已经分析了一个PDF文件，并尽我所能清理它，但我仍然无法在文本文件中对齐信息。在 我的输出如下： <pre><code>Zone 1 Report Name ARREST Incident Time 01:41 Location of Occurrence 1300 block Liverpool St Neighborhood Highland Park Incident 14081898 Age 27 Gender M Section 3921(a) 3925 903 Description Theft by Unlawful Taking or Disposition - Movable item Receiving Stolen Property. Criminal Conspiracy. </code></pre> 我希望它看起来像这样： ^{pr2}$ 我试图在列表中列举，但问题是有些字段不在那里。所以这使得它获取了错误的信息。在 下面是解析PDF的代码 <pre><code>import os import urllib2 import time from datetime import datetime, timedelta from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.converter import TextConverter from pdfminer.layout import LAParams def parsePDF(infile, outfile): password = '' pagenos = set() maxpages = 0 # output option outtype = 'text' imagewriter = None rotation = 0 stripcontrol = False layoutmode = 'normal' codec = 'utf-8' pageno = 1 scale = 1 caching = True showpageno = True laparams = LAParams() rsrcmgr = PDFResourceManager(caching=caching) if outfile: outfp = file(outfile, 'w+') else: outfp = sys.stdout device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter) fp = file(infile, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): interpreter.process_page(page) fp.close() device.close() outfp.close() return # Set time zone to EST #os.environ['TZ'] = 'America/New_York' #time.tzset() # make sure folder system is set up if not os.path.exists("../pdf/"): os.makedirs("../pdf/") if not os.path.exists("../txt/"): os.makedirs("../txt/") # Get yesterday's name and lowercase it yesterday = (datetime.today() - timedelta(1)) yesterday_string = yesterday.strftime("%A").lower() # Also make a numberical representation of date for filename purposes yesterday_short = yesterday.strftime("%Y%m%d") # Get pdf from blotter site, save it in a file pdf = urllib2.urlopen("http://www.city.pittsburgh.pa.us/police/blotter/blotter_" + yesterday_string + ".pdf").read(); f = file("../pdf/" + yesterday_short + ".pdf", "w+") f.write(pdf) f.close() # Convert pdf to text file parsePDF("../pdf/" + yesterday_short + ".pdf", "../txt/" + yesterday_short + ".txt") # Save text file contents in variable parsed_pdf = file("../txt/" + yesterday_short + ".txt", "r").read() </code></pre> 这是我到目前为止的情况。在 <pre><code>import os OddsnEnds = [ "PITTSBURGH BUREAU OF POLICE", "Incident Blotter", "Sorted by:", "DISCLAIMER:", "Incident Date", "assumes", "Page", "Report Name"] if not os.path.exists("../out/"): os.makedirs("../out/") with open("../txt/20140731.txt", 'r') as file: blotterList = file.readlines() with open("../out/test2.txt", 'w') as outfile: cleanList = [] for line in blotterList: if not any ([o in line for o in OddsnEnds]): cleanList.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(line) while '\n' in cleanList: cleanList.remove('\n') for i in [i for i, j in enumerate(cleanList) if j == 'ARREST\n']: print ('Incident:%s' % cleanList[i]) for i in [i for i, j in enumerate(cleanList) if j == 'Incident Time\n']: print ('Time:%s' % cleanList[i+1]) </code></pre> 但是枚举得到的输出是 <pre><code>Time:16:20 Time:17:40 Time:17:53 Time:18:05 Time:Location of Occurrence </code></pre> 因为那件事没有时间。另请注意，所有字符串都以\n结尾 任何和所有的想法和帮助是非常感谢。在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

解析PDF后清理文本文件

1 个回答

相关Python问题