如何在python数据帧中上载.txt文件

2024-10-01 07:15:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图上传一个包含如下数据的txt文件。我在档案中有大约100万条记录。 数据由不同的字段(将是列)组成,我在其中手动添加了逗号作为分隔符。 挑战在于,并非所有记录都具有相同的字段集。 列应为“时间”、“输入”、“TRANSID”、“超级代码”、“ID”、“MRP”、“体积”、“价值”、“产品类型”、“建筑”、“TAXNUM”、“标记字段”

00:00:00.000:, ENTER, transId=1, Supercode=BD3G, id=1, MRP=0.12s9, volume=110333, value=20942463.27, productype=se IA CF, building=11430, taxnumber=110F1, tagFields={B=C C=NZd3/1 D="20170514 07:41:53.616" F=:00000017PouM H=LMT O=6521B841:00023662-A-15.1sd01.200.0.50dsd03.0.0 R="Order not Added" a=A c=FIRST3eNZA j=N}

00:00:00.000:, ENTER,transId=2,Supercode=BYG, id=2, MRP=0.195, volume=223000, value=43485,> productype=se IA CF, building=110, taxnumber=110I1, tagFields={B=C> C=NZ3 D="20170514 07:41:25.161" F=:00000017PouK H=LMT> O=6521B841:00023625-A-15.101.200.0.5003.0.0 R="Ordernot Added" a=A> c=FIRSTNZA j=N}

#For this record, there is no taxnumber , so the TAXnumber column field should be blank/Nan for this record 00:00:00.000:, ENTER, transId=3, Supercode=TBC, id=3,MRP=2.71, volume=3750, value=10162.5, productype=It CF UeCP,> building=110, tagFields={B=C C=4331K D="20170514 > 13:59:51.288" H=LMT K=12345O=6521B841:0027d59B6-B-15.101.200.0.5009.0.0 R="Order notAdded" a=P c=4sd33E> j=N}

#对于此记录,没有建筑编号,因此此记录的建筑编号列字段应为空/Nan

00:00:00.000:, ENTER, transId=4, Supercode=ABT, id=4, MRP=2.73,> volume=357, value=974.61, productype=se IrA CtF, taxnumber=110B1, tagFields={B=C C=ZBJF D="20170929 16:10:01.321" H=LT O=6521B5841:003A98565-A-15.101.2050.0.5009.0.0 R="Order not Added" a=A c=BNPLLCOLO j=Y}

我尝试了以下步骤:

data = pd.read_csv("path.txt",delimiter=",",header=None)

我得到了输出

ParserError: Error tokenizing data. C error: Expected 10 fields in line 66017, saw 11


Tags: idvaluemrp记录cfentersebuilding
2条回答

以下是将数据文件转换为csv文件的小脚本:

import csv

columns =  "TIME ENTER TRANSID SUPERCODE ID MRP VOLUME VALUE PRODUCTYPE BUILDING TAXNUMBER TAGFIELDS".split()

with open("path.txt") as source, open("path.csv", "w") as sink: 
    writer = csv.DictWriter(sink, fieldnames=columns, restval='')
    writer.writeheader()

    for line in source:
        time, enter, *tail = line.split(',')
        key_value_pairs = (item.strip().split('=', maxsplit=1) for item in tail)
        d = {'TIME':time, 'ENTER':enter.strip()}
        d.update((key.upper(),value) for key, value in key_value_pairs)

        writer.writerow(d)

然后,您可以使用:

df = pandas.read_csv("path.csv")

加载数据

尝试在pd中使用engine='python'error_bad_lines=False。read_csv()

相关问题 更多 >