Python库PDFPL不提取行

Anti-Money Laundering and Counter-Terrorism Financing Act 2006 No. 169, 2006 Compilation No. 48 Compilation date: 20 December 2018 Includes amendments up to: Act No. 156, 2018 Registered: 7 January 2019 Prepared by the Office of Parliamentary Counsel, Canberra Authorised Version C2019C00011 registered 07/01/2019

[{'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('2.010'), 'upright': 1, 'x0': Decimal('120.500'), 'y0': Decimal('797.823'), 'x1': Decimal('122.510'), 'y1': Decimal('805.863'), 'width': Decimal('2.010'), 'height': Decimal('8.040'), 'size': Decimal('8.040'), 'object_type': 'char', 'page_number': 1, 'stroking_color': 0, 'non_stroking_color': 0, 'text': ' ', 'top': Decimal('36.057'), 'bottom': Decimal('44.097'), 'doctop': Decimal('36.057')}, {'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('6.138'), 'upright': 1, 'x0': Decimal('120.500'), 'y0': Decimal('170.315'), 'x1': Decimal('126.638'), 'y1': Decimal('181.355'), 'width': Decimal('6.138'), 'height': Decimal('11.040'), 'size': Decimal('11.040'), 'object_type': 'char', 'page_number': 1, 'stroking_color': 0, 'non_stroking_color': 0, 'text': 'P', 'top': Decimal('660.565'), 'bottom': Decimal('671.605'), 'doctop': Decimal('660.565')}, {'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('3.676'), 'upright': 1, 'x0': Decimal('126.638'), 'y0': Decimal('170.315'), 'x1': Decimal('130.315'), 'y1': Decimal('181.355'), 'width': Decimal('3.676'), 'height': Decimal('11.040'), 'size': Decimal('11.040'), 'object_type': 'char', 'page_number': 1, 'stroking_color': 0, 'non_stroking_color': 0, 'text': 'r', 'top': Decimal('660.565'), 'bottom': Decimal('671.605'), 'doctop': Decimal('660.565')}, {'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('4.902'), 'upright': 1, 'x0': Decimal('130.315'), 'y0': Decimal('170.315'), 'x1': Decimal('135.216'), 'y1': Decimal('181.355'), 'width': Decimal('4.902'), 'height': Decimal

3条回答

网友

1楼 · 编辑于 2024-06-21 20:12:59

如果要检测文本行，最好的方法可能是循环检查pdf对象的每个字符，检查字符元数据的更改。pdfplumber提供了大量元数据，但这里对您最有用的可能是：

^{tb1}$

从documentation开始

使用此选项，您可以通过查看字符1与字符2之间的距离是否不同（例如，与页面顶部的距离）来添加换行检查。您可以将每个字符及其top值添加到如下列表中：

your_pdf = pdfplumber.open(your\path\here)
pg28=your_pdf.pages[27]
your_page = pg28.extract_text()
char_list = []
for each_char in pg28.chars:
     char_list.append([each_char["text"], each_char["top"]])

然后，您可以将每个顶部值与列表中的下一个值进行比较，如本answer中所述。这将帮助您确定哪些字符在换行符上或在同一行上

网友

2楼 · 编辑于 2024-06-21 20:12:59

答案在您发布的文档中：

.lines, each representing a single 1-dimensional line.

这是指几何线条（矢量元素），而不是文本线条。PDF没有文本行（或任何更高阶的字符集合）的概念

网友

3楼 · 编辑于 2024-06-21 20:12:59

如果要提取文本行，需要使用PDFMiner（它在pdfplumber下工作）。它将每页上的字符分组为文本行，并将文本行分组为文本框，以实现水平/垂直对齐https://pdfminer-docs.readthedocs.io/programming.html

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextLineHorizontal, LTTextBoxHorizontal


path = open('document.pdf','rb')
parser = PDFParser(path)
document = PDFDocument(parser)
#Create resource manager
rsrcmgr = PDFResourceManager()
# Set parameters for analysis.
laparams = LAParams()
laparams.detect_vertical = True
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    # receive the LTPage object for the page.
    layout = device.get_result()
    for element in layout:
        if isinstance(element, LTTextLineHorizontal):
            print(element.get_text())
        if isinstance(element, LTTextBoxHorizontal):
            for line in element:
                print(line.get_text())

相关问题更多 >

编程相关推荐

热门问题

热门文章