回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我已经分析了一个PDF文件,并尽我所能清理它,但我仍然无法在文本文件中对齐信息。在</p>
<p>我的输出如下:</p>
<pre><code>Zone
1
Report Name
ARREST
Incident Time
01:41
Location of Occurrence
1300 block Liverpool St
Neighborhood
Highland Park
Incident
14081898
Age
27
Gender
M
Section
3921(a)
3925
903
Description
Theft by Unlawful Taking or Disposition - Movable item
Receiving Stolen Property.
Criminal Conspiracy.
</code></pre>
<p>我希望它看起来像这样:</p>
^{pr2}$
<p>我试图在列表中列举,但问题是有些字段不在那里。所以这使得它获取了错误的信息。在</p>
<p>下面是解析PDF的代码</p>
<pre><code>import os
import urllib2
import time
from datetime import datetime, timedelta
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
def parsePDF(infile, outfile):
password = ''
pagenos = set()
maxpages = 0
# output option
outtype = 'text'
imagewriter = None
rotation = 0
stripcontrol = False
layoutmode = 'normal'
codec = 'utf-8'
pageno = 1
scale = 1
caching = True
showpageno = True
laparams = LAParams()
rsrcmgr = PDFResourceManager(caching=caching)
if outfile:
outfp = file(outfile, 'w+')
else:
outfp = sys.stdout
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter)
fp = file(infile, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp, pagenos,
maxpages=maxpages, password=password,
caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
outfp.close()
return
# Set time zone to EST
#os.environ['TZ'] = 'America/New_York'
#time.tzset()
# make sure folder system is set up
if not os.path.exists("../pdf/"):
os.makedirs("../pdf/")
if not os.path.exists("../txt/"):
os.makedirs("../txt/")
# Get yesterday's name and lowercase it
yesterday = (datetime.today() - timedelta(1))
yesterday_string = yesterday.strftime("%A").lower()
# Also make a numberical representation of date for filename purposes
yesterday_short = yesterday.strftime("%Y%m%d")
# Get pdf from blotter site, save it in a file
pdf = urllib2.urlopen("http://www.city.pittsburgh.pa.us/police/blotter/blotter_" + yesterday_string + ".pdf").read();
f = file("../pdf/" + yesterday_short + ".pdf", "w+")
f.write(pdf)
f.close()
# Convert pdf to text file
parsePDF("../pdf/" + yesterday_short + ".pdf", "../txt/" + yesterday_short + ".txt")
# Save text file contents in variable
parsed_pdf = file("../txt/" + yesterday_short + ".txt", "r").read()
</code></pre>
<p>这是我到目前为止的情况。在</p>
<pre><code>import os
OddsnEnds = [ "PITTSBURGH BUREAU OF POLICE", "Incident Blotter", "Sorted by:", "DISCLAIMER:", "Incident Date", "assumes", "Page", "Report Name"]
if not os.path.exists("../out/"):
os.makedirs("../out/")
with open("../txt/20140731.txt", 'r') as file:
blotterList = file.readlines()
with open("../out/test2.txt", 'w') as outfile:
cleanList = []
for line in blotterList:
if not any ([o in line for o in OddsnEnds]):
cleanList.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(line)
while '\n' in cleanList:
cleanList.remove('\n')
for i in [i for i, j in enumerate(cleanList) if j == 'ARREST\n']:
print ('Incident:%s' % cleanList[i])
for i in [i for i, j in enumerate(cleanList) if j == 'Incident Time\n']:
print ('Time:%s' % cleanList[i+1])
</code></pre>
<p>但是枚举得到的输出是</p>
<pre><code>Time:16:20
Time:17:40
Time:17:53
Time:18:05
Time:Location of Occurrence
</code></pre>
<p>因为那件事没有时间。另请注意,所有字符串都以\n结尾</p>
<p>任何和所有的想法和帮助是非常感谢。在</p>