pyPDF2中的extractText()函数引发

2024-05-19 02:09:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从PDF中提取文本以便进行分析,但是当我试图从页面中提取文本时,我收到以下错误。在

Traceback (most recent call last):
File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_comm.py", line 765, in doIt
    result = pydevd_vars.evaluateExpression(self.thread_id, self.frame_id, self.expression, self.doExec)

File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_vars.py", line 376, in evaluateExpression
    result = eval(compiled, updated_globals, frame.f_locals)

File "<string>", line 1, in <module>

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1701, in extractText
    content = ContentStream(content, self.pdf)

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1783, in __init__
    stream = StringIO(stream.getData())

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\generic.py", line 801, in getData
    decoded._data = filters.decodeStreamData(self)

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 228, in decodeStreamData
    data = ASCII85Decode.decode(data)

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in decode
    data = [y for y in data if not (y in ' \n\r\t')]

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in <listcomp>
    data = [y for y in data if not (y in ' \n\r\t')]

TypeError: 'in <string>' requires string as left operand, not int

相关代码部分如下:

^{pr2}$

错误将在extractText()行引发。在


Tags: inpyselfdataegglibpackagesline
2条回答

你在一条线上做两件事。试着打破已做的事情来接近问题。更改:

page_Content = Pdf_File.getPage(pg_idx).extractText()

进入

^{pr2}$

看看哪里出错了。另外,从命令行而不是从Eclipse运行该程序,以确保它是相同的错误。您说它发生在extractText(),但这一行没有出现在回溯中。在

也许值得尝试一下PyPDF2的最新版本,在我写这篇文章时,它的最新版本是1.24。在

尽管如此,我发现extractText()功能非常脆弱。它对某些文档有效,对其他文档无效。查看一些公开问题:

https://github.com/mstamy2/PyPDF2/issues/180https://github.com/mstamy2/PyPDF2/issues/168

我通过使用Poppler命令行实用程序pdftotext来解决这个问题,既可以将文档分类为图像对文本,也可以获取所有内容。对我来说非常稳定-我已经在数千个PDF文档上运行过了。根据我的经验,它还可以从受保护/加密的pdf中提取文本,而无需进一步的ado。在

例如(为Python2编写):

def consult_pdftotext(filename):
    '''
    Runs pdftotext to extract text of pages 1..3.
    Returns the count of characters received.

    `filename`: Name of PDF file to be analyzed.
    '''
    print("Running pdftotext on file %s" % filename, file=sys.stderr)
    # don't forget that final hyphen to say, write to stdout!!
    cmd_args = [ "pdftotext", "-f", "1", "-l", "3", filename, "-" ]
    pdf_pipe = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    std_out, std_err = pdf_pipe.communicate()
    count = len(std_out)
    return count

高温

相关问题 更多 >

    热门问题