我试图从PDF中提取文本以便进行分析,但是当我试图从页面中提取文本时,我收到以下错误。在
Traceback (most recent call last):
File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_comm.py", line 765, in doIt
result = pydevd_vars.evaluateExpression(self.thread_id, self.frame_id, self.expression, self.doExec)
File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_vars.py", line 376, in evaluateExpression
result = eval(compiled, updated_globals, frame.f_locals)
File "<string>", line 1, in <module>
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1701, in extractText
content = ContentStream(content, self.pdf)
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1783, in __init__
stream = StringIO(stream.getData())
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\generic.py", line 801, in getData
decoded._data = filters.decodeStreamData(self)
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 228, in decodeStreamData
data = ASCII85Decode.decode(data)
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in decode
data = [y for y in data if not (y in ' \n\r\t')]
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in <listcomp>
data = [y for y in data if not (y in ' \n\r\t')]
TypeError: 'in <string>' requires string as left operand, not int
相关代码部分如下:
^{pr2}$错误将在extractText()行引发。在
你在一条线上做两件事。试着打破已做的事情来接近问题。更改:
进入
^{pr2}$看看哪里出错了。另外,从命令行而不是从Eclipse运行该程序,以确保它是相同的错误。您说它发生在
extractText()
,但这一行没有出现在回溯中。在也许值得尝试一下PyPDF2的最新版本,在我写这篇文章时,它的最新版本是1.24。在
尽管如此,我发现extractText()功能非常脆弱。它对某些文档有效,对其他文档无效。查看一些公开问题:
https://github.com/mstamy2/PyPDF2/issues/180和https://github.com/mstamy2/PyPDF2/issues/168
我通过使用Poppler命令行实用程序pdftotext来解决这个问题,既可以将文档分类为图像对文本,也可以获取所有内容。对我来说非常稳定-我已经在数千个PDF文档上运行过了。根据我的经验,它还可以从受保护/加密的pdf中提取文本,而无需进一步的ado。在
例如(为Python2编写):
高温
相关问题 更多 >
编程相关推荐