解析PDF后清理文本文件

Zone 1 Report Name ARREST Incident Time 01:41 Location of Occurrence 1300 block Liverpool St Neighborhood Highland Park Incident 14081898 Age 27 Gender M Section 3921(a) 3925 903 Description Theft by Unlawful Taking or Disposition - Movable item Receiving Stolen Property. Criminal Conspiracy.

import os import urllib2 import time from datetime import datetime, timedelta from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.converter import TextConverter from pdfminer.layout import LAParams def parsePDF(infile, outfile): password = '' pagenos = set() maxpages = 0 # output option outtype = 'text' imagewriter = None rotation = 0 stripcontrol = False layoutmode = 'normal' codec = 'utf-8' pageno = 1 scale = 1 caching = True showpageno = True laparams = LAParams() rsrcmgr = PDFResourceManager(caching=caching) if outfile: outfp = file(outfile, 'w+') else: outfp = sys.stdout device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter) fp = file(infile, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): interpreter.process_page(page) fp.close() device.close() outfp.close() return # Set time zone to EST #os.environ['TZ'] = 'America/New_York' #time.tzset() # make sure folder system is set up if not os.path.exists("../pdf/"): os.makedirs("../pdf/") if not os.path.exists("../txt/"): os.makedirs("../txt/") # Get yesterday's name and lowercase it yesterday = (datetime.today() - timedelta(1)) yesterday_string = yesterday.strftime("%A").lower() # Also make a numberical representation of date for filename purposes yesterday_short = yesterday.strftime("%Y%m%d") # Get pdf from blotter site, save it in a file pdf = urllib2.urlopen("http://www.city.pittsburgh.pa.us/police/blotter/blotter_" + yesterday_string + ".pdf").read(); f = file("../pdf/" + yesterday_short + ".pdf", "w+") f.write(pdf) f.close() # Convert pdf to text file parsePDF("../pdf/" + yesterday_short + ".pdf", "../txt/" + yesterday_short + ".txt") # Save text file contents in variable parsed_pdf = file("../txt/" + yesterday_short + ".txt", "r").read()

import os OddsnEnds = [ "PITTSBURGH BUREAU OF POLICE", "Incident Blotter", "Sorted by:", "DISCLAIMER:", "Incident Date", "assumes", "Page", "Report Name"] if not os.path.exists("../out/"): os.makedirs("../out/") with open("../txt/20140731.txt", 'r') as file: blotterList = file.readlines() with open("../out/test2.txt", 'w') as outfile: cleanList = [] for line in blotterList: if not any ([o in line for o in OddsnEnds]): cleanList.append(line) while '\n' in cleanList: cleanList.remove('\n') for i in [i for i, j in enumerate(cleanList) if j == 'ARREST\n']: print ('Incident:%s' % cleanList[i]) for i in [i for i, j in enumerate(cleanList) if j == 'Incident Time\n']: print ('Time:%s' % cleanList[i+1])

2条回答

网友

1楼 · 编辑于 2024-05-08 14:56:36

一般来说，从PDF文件中提取文本（尤其是当您希望包含文本的格式/间距/布局时）被认为是一项任务，可能并不总是100%准确。我是从一家公司的技术支持人员那里了解到这一点的，这家公司生产一个用于从PDF中提取文本的流行库（xpdf），当时我正在从事该领域的一个项目。那时，我研究了几个从文本中提取PDF的库，包括xpdf和其他一些库。为什么它们不能总是给出完美的结果（尽管在很多情况下是这样的），有明确的技术原因；这些原因与PDF格式的性质以及PDF是如何生成的有关。从某些PDF中提取文本时，布局和间距可能不会保留，即使使用库的选项（如keep_format=True或等效选项）。

这个问题的唯一永久解决方案是不需要从PDF文件中提取文本。相反，始终尝试使用生成PDF文件的数据格式和数据源，并使用该格式进行文本提取/操作。当然，如果你无法获得这些信息来源，说起来容易做起来难。

网友

2楼 · 编辑于 2024-05-08 14:56:36

我最喜欢的一种方法是通过使用pdftotext（来自poppler实用程序）和-layout选项来获取文本。它很擅长保留文档的原始布局。

您可以使用subprocess模块从Python中使用它。

相关问题更多 >

编程相关推荐

热门问题

热门文章