在python中将文本转换为表

2024-09-30 01:25:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我不熟悉python和pandas,正在尝试将pdf文档转换为panda数据帧

我能够将pdf提取到文本中,但不确定如何将文本文件中的以下所需值保存到字典中

到目前为止,我已使用以下代码将pdf提取为文本,如下所示:

import pandas as pd
import pdfplumber

data = './data/coc.pdf'

with pdfplumber.open(data) as pdf:
    page = pdf.pages[0]
    text = page.extract_text()

print(text)

下面是文本文件

" \n \nCertificate of Currency \nXYZ Limited \nABN 121011100 54720   AFSL 232111 \n \nAs at Date \n2 November 2015 \nPolicy Information \nPolicy Type \nProfessional  \n \n \nInsured \nUniversity of ABC and others as defined by the policy \ndocument. \n \nInsurer \nMMO Limited   \n                                                    \nPolicy Number(s) \L0K107721013  \n \nPeriod of Insurance \nFrom 4.00pm 1/11/2015 to 4.00pm 1/11/2016 \n \nInterest Insured \nLoss incurred as a result of a civil liability claim made against the insured \nbased solely on the insured’s provision of their professional services  \n \nLimit of Liability \n$20,000,000 any one claim and $60,000,000 in the aggregate for all claims "

现在我想把文本文件转换成字典。如何使用if条件更新下面的代码以创建一个字典,该字典将基于\n或\n\n的key:value对

d = dict([i.splitlines() for i in text.split('\n\n')])

我得到以下错误:

错误:字典更新序列元素:

enter image description here

请查看随附的所需输出表:

期望输出:

enter image description here


Tags: ofthe代码text文本importpandasdata
2条回答

更新的答案

我做了一些改变,使它更容易做,并使它成为如何真正找到价值观

from pprint import pprint
from datetime import datetime


def clean_text(text) -> list:
    """Removes empty lines as well as leading and trailing spaces.
    Also removes EOL characters.

    Args:
        text (str/list): Input text

    Returns:
        (list): A list of strings
    """
    if type(text) == str:
        splittext = text.splitlines()
        if len(splittext) == 1:
            print("Text is a single line string")
            return text
    elif type(text) == list:
        splittext = text
    result = []
    for line in splittext:
        cleaned = line.strip()
        if cleaned != "":
            result.append(cleaned)
    return result


filename = 'text_sample1.txt'
with open(filename) as infile:
    text = infile.read()



uncleaned_sections = text.split('\n \n')
sections = []
for section in uncleaned_sections:
    sections.append(clean_text(section.splitlines()))

for secindex, section in enumerate(sections):
    for lineindex, line in enumerate(section):
        print(f'sections[{secindex}][{lineindex}]: {line}')


# with the above, we have sections of data, instead of a block of data
# this means we can change the way we deal with it

# we should have 7 sections, provided that all input files are structured
# in the same way
assert(len(sections) == 7)


pd_dt = datetime.strptime(sections[1][1], '%d %B %Y')
policy_date = f'{pd_dt.day:02}/{pd_dt.month:02}/{pd_dt.year}'
abn = sections[0][2].split('ABN ')[1].split('AFSL')[0].strip()
policy_number = sections[3][3]
period_start = sections[4][1].split("From ")[1].split(" to ")[0].split(' ')[1]
period_end = sections[4][1].split(" to ")[1].split(' ')[1]
insured = ' '.join(sections[2][1:])
insurer = sections[3][1]
interest_insured = ' '.join(sections[5][1:])

as_dict = {
    'Date': policy_date,
    'ABN': abn,
    'Policy Number': policy_number,
    'Period Start': period_start,
    'Period End': period_end,
    'Insured': insured,
    'Insurer': insurer,
    'Interest Insured': interest_insured
}

pprint(as_dict)

输出

sections[0][0]: Certificate of Currency
sections[0][1]: XYZ Limited
sections[0][2]: ABN 121 011100 54720   AFSL 81141141
sections[1][0]: As at Date
sections[1][1]: 2 November 2015
sections[1][2]: Policy Information
sections[1][3]: Policy Type
sections[1][4]: Professional
sections[2][0]: Insured
sections[2][1]: University of ABC and others as defined by the policy
sections[2][2]: document.
sections[3][0]: Insurer
sections[3][1]: MMO Limited
sections[3][2]: Policy Number(s)
sections[3][3]: L0K107721013
sections[4][0]: Period of Insurance
sections[4][1]: From 4.00pm 1/11/2015 to 4.00pm 1/11/2016
sections[5][0]: Interest Insured
sections[5][1]: Loss incurred as a result of a civil liability claim made against the insured
sections[5][2]: based solely on the insured’s provision of their professional services
sections[6][0]: Limit of Liability
sections[6][1]: $20,000,000 any one claim and $60,000,000 in the aggregate for all claims
sections[6][2]: during the period of insurance. (Subject to the reinstatement provisions of
sections[6][3]: the policy).
sections[6][4]: ABN 121 011100 54720
{'ABN': '121 011100 54720',
 'Date': '02/11/2015',
 'Insured': 'University of ABC and others as defined by the policy document.',
 'Insurer': 'MMO Limited',
 'Interest Insured': 'Loss incurred as a result of a civil liability claim '
                     'made against the insured based solely on the insured’s '
                     'provision of their professional services',
 'Period End': '1/11/2016',
 'Period Start': '1/11/2015',
 'Policy Number': 'L0K107721013'}

注意:前面的答案

此答案适用于问题的上一次迭代:-/


您需要处理文本中的行,并找到正确的行来提取所需的数据

我提供了一些例子,说明如何找到问题中列出的值。我建议您检查一下我使用的基本字符串方法

string.split()
string.splitlines()
string.strip()
f-strings (I really like f-strings)

下面的代码应该足以让您开始

代码

def clean_text(text: str) -> list:
    """Removes empty lines as well as leading and trailing spaces.
    Also removes EOL characters.

    Args:
        text (str): Input text

    Returns:
        (list): A list of strings
    """
    splittext = text.splitlines()
    if len(splittext) == 1:
        print("Text is a single line string")
        return text
    result = []
    for line in splittext:
        cleaned = line.strip()
        if cleaned != "":
            result.append(cleaned)
    return result


text = " \n \nCertificate of Currency \nXYZ Limited \nABN 121011100 54720   AFSL 232111 \n \nAs at Date \n2 November 2015 \nPolicy Information \nPolicy Type \nProfessional  \n \n \nInsured \nUniversity of ABC and others as defined by the policy \ndocument. \n \nInsurer \nMMO Limited   \n                                                    \nPolicy Number(s) \L0K107721013  \n \nPeriod of Insurance \nFrom 4.00pm 1/11/2015 to 4.00pm 1/11/2016 \n \nInterest Insured \nLoss incurred as a result of a civil liability claim made against the insured \nbased solely on the insured’s provision of their professional services  \n \nLimit of Liability \n$20,000,000 any one claim and $60,000,000 in the aggregate for all claims "
text = clean_text(text)
# I alsways add this when using the 'previous_line' method below
# it can reduce failures
text.append('')


previous_line = ""
for line in text:
    # two \ needed due to it being a special character in strings
    if "Policy Number(s)" in previous_line:
        policy_number = line
    elif "From " in line and "Period of Insurance" in previous_line:
        # this is a secondary check for the start/end dates
        # Just in case another line in the text contains 'From '
        start = line.split("From ")[1].split(" to ")[0]
        end = line.split(" to ")[1]
    elif "ABN " and "AFSL " in line:
        # using different method than the splits above
        abn_split = line.split()
        abn_value = f"{abn_split[1]} {abn_split[2]}"

    previous_line = line

# Some try/except blocks to check if the values have been found
try:
    print(f"Start: {start}\nEnd: {end}")
except NameError as e:
    print("Start/End dates not found")

try:
    print(f"ABN: {abn_value}")
except NameError as e:
    print("ABN not found")

try:
    print(f"Policy Number: {policy_number}")
except NameError as e:
    print("Policy number not found")

输出

Start: 4.00pm 1/11/2015
End: 4.00pm 1/11/2016
ABN: 121011100 54720
Policy Number: L0K107721013  

您的问题确实有点广泛,但如果我理解正确,您的主要问题是将PDF中的表转换为Pandas dataframe,并且在将PDF作为文本阅读后,您制定了问题中的问题。然而,我认为只使用tabla更容易,让它直接将PDF解析为dataframe

安装表格:

pip install tabula-py

使用表格:

import tabula
dfs = tabula.read_pdf("myfile.pdf", pages='all')
# Note that dfs is list of dataframes, the tables found in the PDF.

就这样。您的其他问题与Pandas有关,我建议您在Pandas文档/论坛中搜索答案,或者就如何处理数据帧提出单独的问题

请注意,PDF处理从来都不是100%准确的,因为PDF本质上不是由机器解释的。此解决方案可能有效,也可能无效,具体取决于PDF。对我来说,它确实解决了类似的问题

相关问题 更多 >

    热门问题