尝试从PDF提取时，“非类型对象不可编辑”

import requests import pdfplumber import pandas as pd import re from collections import namedtuple Line = namedtuple('Line', 'gbloc_name contact_type email') gbloc_re = re.compile(r'^(?:a\.\s[A-Z]{5}\:\s[A-Z]{4})') line_re = re.compile(r'^[^@\s]+@[^@\s]\.[^@\s]+$') file = 'sampleReport.pdf' lines=[] with pdfplumber.open(file) as pdf: pages = pdf.pages for page in pdf.pages: text = page.extract_text() for line in text: gbloc = gbloc_re.search(line) if gbloc: gbloc_name = gbloc elif line.startswith('Outbound'): contact_type = 'Outbound' elif line.startswith('Tracing'): contact_type = 'Tracing' elif line.startswith('Customer'): contact_type = 'Customer Service' elif line.startswith('QA'): contact_type = 'Quality Assurance' elif line.startswith('NTS'): contact_type = 'NTS' elif line.startswith('Inbound'): contact_type = 'Inbound' elif line_re.search(line): items = line.split() lines.append(Line(gbloc_name, contact_type, *items))

2条回答

网友

1楼 · 编辑于 2024-10-04 01:30:18

我使用libPyPDF2从PDF中提取文本。在这里，我做了一个简单的源代码。它将按页面提取内容

import PyPDF2

with open('example.pdf', 'rb') as pdfFileObj:
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    print(pdfReader.numPages)
    for i in range(0, pdfReader.numPages):
        print("Page: ", i)
        pageObj = pdfReader.getPage(i)
        print(pageObj.extractText())

图像结果：

如果您有任何问题，请检查并回复我

网友

2楼 · 编辑于 2024-10-04 01:30:18

尝试将循环直接设置为与页面相等。extract_text（）值。像这样：

with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        for line in page.extract_text():

相关问题更多 >

编程相关推荐

热门问题

热门文章