使用DataFrame将列键/值转换为多行

2024-09-27 19:23:38 发布

您现在位置:Python中文网/ 问答频道 /正文

CSV文件:(sample1.CSV)

id;name;desc;hist
1;Fulano;A;action: Test1\ndate: 04/10/2021 09:00:00\n\naction: Test2\ndate: 04/10/2021 09:00:00\n\naction: Test3\ndate: 04/10/2021 09:00:00\nauto: TESTE\n\n
2;Ciclano;B;action: Test1\ndate: 03/02/2021 14:23:24\n\naction: Test2\ndate: 03/02/2021 14:23:24\n\naction: Test3\ndate: 03/02/2021 14:23:24\nauto: TESTE\n\n
3;Beltrano;C;naction: Test3\ndate: 04/10/2021 09:00:00\nauto: TESTE\n\n

我想将CSV的“hist”列转换为以下输出

|id|name    |desc|action|date                |auto
|1 |Fulano  |A   |Test1 |04/10/2021 09:00:00 |
|1 |Fulano  |A   |Test1 |04/10/2021 09:00:00 |
|1 |Fulano  |A   |Test1 |04/10/2021 09:00:00 |TESTE
|2 |Ciclano |B   |TEST3 |03/02/2021 14:23:24 |
|2 |Ciclano |B   |TEST3 |03/02/2021 14:23:24 |
|2 |Ciclano |B   |TEST3 |03/02/2021 14:23:24 |TESTE
|3 |Beltrano|C   |TEST2 |04/02/2021 14:23:24 |TESTE

我已经将csv读入数据帧,但我不知道如何转换它?有人能帮我吗


Tags: csvnameidactiondeschisttest1teste
2条回答

您可以解析csv文件并提取最后三个字段:

import io
import pandas as pd

csv = """
id;name;desc;hist
1;Fulano;A;action: Test1\ndate: 04/10/2021 09:00:00\n\naction: Test2\ndate: 04/10/2021 09:00:00\n\naction: Test3\ndate: 04/10/2021 09:00:00\nauto: TESTE\n\n
2;Ciclano;B;action: Test1\ndate: 03/02/2021 14:23:24\n\naction: Test2\ndate: 03/02/2021 14:23:24\n\naction: Test3\ndate: 03/02/2021 14:23:24\nauto: TESTE\n\n
3;Beltrano;C;action: Test3\ndate: 04/10/2021 09:00:00\nauto: TESTE\n\n
"""

header = 'id,name,desc,action,date,auto'
new_csv = [header]

with io.StringIO(csv) as ftx:
    lines = ftx.readlines()

first_row = None
for index, line in enumerate(lines):
    if 'id;name;desc;hist' in line:
        first_row = index + 1
        break

if first_row:
    index = first_row
    end = len(lines)
    while index < end:
        
        if ';' in lines[index]:
            col_id, col_name, col_desc, hist = lines[index].split(';')
            parse = hist.split(': ')
            action, date, auto = ['', '', '']
            
            while True:
                
                if parse[0].strip() == 'action':
                    action = parse[1].strip()
                elif parse[0].strip() == 'date':
                    date = parse[1].strip()
                elif parse[0].strip() == 'auto':
                    auto = parse[1].strip()

                index += 1
                
                if index >= end:
                    new_csv.append(','.join([col_id, col_name, col_desc, action, date, auto]))
                    break
                    
                if lines[index].strip() == '':
                    new_csv.append(','.join([col_id, col_name, col_desc, action, date, auto]))
                    index += 1
                    
                    if index >= end or lines[index].strip() == '':
                        index += 1
                        break
                
                parse = lines[index].split(': ')
                
new_csv = '\n'.join(new_csv)

df = pd.read_csv(io.StringIO(new_csv), parse_dates=['date'], keep_default_na=False)

df

   id      name  desc action                date   auto
0   1    Fulano    A  Test1  2021-04-10 09:00:00       
1   1    Fulano    A  Test2  2021-04-10 09:00:00       
2   1    Fulano    A  Test3  2021-04-10 09:00:00  TESTE
3   2   Ciclano    B  Test1  2021-03-02 14:23:24       
4   2   Ciclano    B  Test2  2021-03-02 14:23:24       
5   2   Ciclano    B  Test3  2021-03-02 14:23:24  TESTE
6   3  Beltrano    C  Test3  2021-04-10 09:00:00  TESTE

我只使用pandas和一个helper函数就成功了。我就是这样做的:

import pandas as pd

def list2Dict(l):
    d = dict()
    for x in l:
        k, v = tuple(x.split(": "))
        d[k] = v
    return d

df = pd.read_csv("sample1.csv", sep=";", encoding="utf-8")
df["hist"] = df["hist"].str.replace(r'\\n','|', regex=True)

#Remove empty elements
df["hist"] = df["hist"].str.split('\|\|').apply(lambda x: [e for e in x if len(e) > 0 ])

#Splits only what is needed to parse
df["hist"] = df["hist"].apply(lambda x: [list(e.split("|")) for e in x])

#Convert list of dicts to dicts
df["hist"] = df["hist"].apply(lambda x: [list2Dict(l) for l in x])

#Explode JSON column
exploded_df = df.explode('hist')

#Uses JSON normalize to create new columns
final_df = pd.merge(exploded_df.reset_index(drop=True), pd.json_normalize(exploded_df["hist"]), left_index=True, right_index=True).drop("hist", axis=1)

final_df.head(10)

   id      name desc action                 date   auto
0   1    Fulano    A  Test1  04/10/2021 09:00:00    NaN
1   1    Fulano    A  Test2  04/10/2021 09:00:00    NaN
2   1    Fulano    A  Test3  04/10/2021 09:00:00  TESTE
3   2   Ciclano    B  Test1  03/02/2021 14:23:24    NaN
4   2   Ciclano    B  Test2  03/02/2021 14:23:24    NaN
5   2   Ciclano    B  Test3  03/02/2021 14:23:24  TESTE
6   3  Beltrano    C  Test3  04/10/2021 09:00:00  TESTE

这里真正重要的是:分解包含dict列表的列,并使用json_规范化自动解析结果

谢谢

相关问题 更多 >

    热门问题