PyPDF2忽略内容,仅获取水印

2024-09-23 10:24:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我有成千上万的PDF文件,比如this one。在

我尝试使用PyPDF2将它们转换为纯文本(代码如下)。但是PyPDF2显然只“看到”了水印,而不是内容本身。我能在这里做什么?在

import os
import PyPDF2

path_to_pdfs = '/path/to/pdf/files/'
for filename in os.listdir(path_to_pdfs):
    if '.pdf' in filename.lower():
        with open(path_to_pdfs + filename, mode = 'rb') as f:
            txt = ''
            pdf_reader = PyPDF2.PdfFileReader(f)
            num_pages = pdf_reader.numPages
            for page in range(num_pages):
                page_obj = pdf_reader.getPage(page)
                page_text = page_obj.extractText()
                txt = txt + '\n' + page_text
            print(txt)

我在macos10.13.14上使用python3.5.1和PyPDF2 1.26.0。在


Tags: topathinimporttxtforpdfos
1条回答
网友
1楼 · 发布于 2024-09-23 10:24:17

有时pdfminer3k会产生更好的结果。请查看“How to read pdf file using pdfminer3k?

我测试了下面的代码,它提取了文本。然而,提取并不是100%准确。。。在

# Open the example file
fp = open('Decisao_10166720039201098.pdf', 'rb')

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine

parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 1.0
laparams.word_margin = 1.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''

for page in doc.get_pages():
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            extracted_text += lt_obj.get_text()

print(extracted_text)

相关问题 更多 >