在Python中,如何将包含未对齐数据的.txt文件加载到数据帧中

2024-09-30 08:36:15 发布

您现在位置:Python中文网/ 问答频道 /正文

下面提到了.txt格式的数据文件(df),其中一些记录缺少几个字段。缺少的字段应在相应列中保持为空白

例如,txt格式的数据文件是

1,name=Messi,car=ford,Price=234,Bike=Harley  
2,name=Cavani,car=mazda,price=58,Bike=Ducatti  
3,name=Dembele,car=toyota,Bike=Yamaha        
4,name=kevin,car=Ford,price=989    
5,name=Aguero,Bike=Ducatti       
6,name=nadal,car=Ferrari,Bike=Harley

我希望文件以以下格式加载到Python: 具有相应列名的必需输出:

Output_image

我想要的列名称为数字,卡纳姆,价格,比克内姆。我希望在数据框中填充各个列名称下的各个数据。各列字段下的空值应保持为空

由于格式问题,我无法发布输出图像或在此处键入输出。由于我是stackoverflow的新手,我没有足够的声誉来发布图片

请注意,我的数据集有一百万条记录


Tags: 数据nametxtdf数据文件格式记录car
2条回答

您可以将数据写入中间CSV。添加一些文件修改时间检查,只有在数据文本文件发生更改时才能进行转换

import io
import csv
import pandas as pd
from pathlib import Path

header = ["Number", "CARNAME", "PRICE", "BIKENAME"]
key_to_index = {"car":1, "Price":2, "Bike":3}

def build_car_info_csv(in_fileobj, out_fileobj):
    reader = csv.reader(in_fileobj)
    writer = csv.writer(out_fileobj)
    for row in reader:
        outrow = [''] *len(header)
        outrow[0] = row.pop(0)
        for cell in row:
            key, val = cell.split("=")
            try:
                outrow[key_to_index[key]] = val
            except KeyError:
                # ignore unwanted keys
                pass
        writer.writerow(outrow)

def read_car_info_df(filename):
    filename = Path(filename)
    csv_filename = filename.with_suffix(".csv")
    mtime = filename.stat().st_mtime
    csv_mtime = csv_filename.stat().st_mtime if csv_filename.is_file() else 0
    if mtime > csv_mtime:
        with filename.open(newline="") as infile,\
                csv_filename.open("w", newline="") as outfile:
            build_car_info_csv(infile, outfile)
    return pd.read_csv(csv_filename)

测试

open("mytest.txt", "w").write("""1,name=Messi,car=ford,Price=234,Bike=Harley
2,name=Cavani,car=mazda,price=58,Bike=Ducatti
3,name=Dembele,car=toyota,Bike=Yamaha
4,name=kevin,car=Ford,price=989    5,name=Aguero,Bike=Ducatti
6,name=nadal,car=Ferrari,Bike=Harley""")

df = read_car_info_df("mytest.txt")
print(df)

专门处理这种非标准和非统一文件格式的高效库存在的可能性很小。因此,我将逐行手动解析这个文件到list of dicts中,其中缺少的键(列)可以由DataFrame()构造函数处理

代码:

path_to_file = "/mnt/ramdisk/in.txt"
ls_dic = []
with open(path_to_file) as f:
    for line in f:
        ls = line.split(",")
        dic = {}
        dic["Number"] = ls[0]
        for k_v in ls[1:]:
            k, v = k_v.split("=")
            dic[k.capitalize()] = v.strip()
        ls_dic.append(dic)

df = pd.DataFrame(ls_dic)

结果:

print(df)

  Number     Name      Car Price     Bike
0      1    Messi     ford   234   Harley
1      2   Cavani    mazda    58  Ducatti
2      3  Dembele   toyota   NaN   Yamaha
3      4    kevin     Ford   989      NaN
4      5   Aguero      NaN   NaN  Ducatti
5      6    nadal  Ferrari   NaN   Harley

相关问题 更多 >

    热门问题