pyPDF2中的extractText（）函数引发

Traceback (most recent call last): File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_comm.py", line 765, in doIt result = pydevd_vars.evaluateExpression(self.thread_id, self.frame_id, self.expression, self.doExec) File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_vars.py", line 376, in evaluateExpression result = eval(compiled, updated_globals, frame.f_locals) File "<string>", line 1, in <module> File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1701, in extractText content = ContentStream(content, self.pdf) File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1783, in __init__ stream = StringIO(stream.getData()) File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\generic.py", line 801, in getData decoded._data = filters.decodeStreamData(self) File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 228, in decodeStreamData data = ASCII85Decode.decode(data) File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in decode data = [y for y in data if not (y in ' \n\r\t')] File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in <listcomp> data = [y for y in data if not (y in ' \n\r\t')] TypeError: 'in <string>' requires string as left operand, not int

2条回答

网友

1楼 · 编辑于 2024-05-19 02:09:17

你在一条线上做两件事。试着打破已做的事情来接近问题。更改：

page_Content = Pdf_File.getPage(pg_idx).extractText()

进入

^{pr2}$

看看哪里出错了。另外，从命令行而不是从Eclipse运行该程序，以确保它是相同的错误。您说它发生在extractText()，但这一行没有出现在回溯中。在

网友

2楼 · 编辑于 2024-05-19 02:09:17

也许值得尝试一下PyPDF2的最新版本，在我写这篇文章时，它的最新版本是1.24。在

尽管如此，我发现extractText（）功能非常脆弱。它对某些文档有效，对其他文档无效。查看一些公开问题：

https://github.com/mstamy2/PyPDF2/issues/180和https://github.com/mstamy2/PyPDF2/issues/168

我通过使用Poppler命令行实用程序pdftotext来解决这个问题，既可以将文档分类为图像对文本，也可以获取所有内容。对我来说非常稳定-我已经在数千个PDF文档上运行过了。根据我的经验，它还可以从受保护/加密的pdf中提取文本，而无需进一步的ado。在

例如（为Python2编写）：

def consult_pdftotext(filename):
    '''
    Runs pdftotext to extract text of pages 1..3.
    Returns the count of characters received.

    `filename`: Name of PDF file to be analyzed.
    '''
    print("Running pdftotext on file %s" % filename, file=sys.stderr)
    # don't forget that final hyphen to say, write to stdout!!
    cmd_args = [ "pdftotext", "-f", "1", "-l", "3", filename, "-" ]
    pdf_pipe = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    std_out, std_err = pdf_pipe.communicate()
    count = len(std_out)
    return count

高温

相关问题更多 >

编程相关推荐

热门问题

热门文章