使用DataFrame将列键/值转换为多行

id;name;desc;hist 1;Fulano;A;action: Test1\ndate: 04/10/2021 09:00:00\n\naction: Test2\ndate: 04/10/2021 09:00:00\n\naction: Test3\ndate: 04/10/2021 09:00:00\nauto: TESTE\n\n 2;Ciclano;B;action: Test1\ndate: 03/02/2021 14:23:24\n\naction: Test2\ndate: 03/02/2021 14:23:24\n\naction: Test3\ndate: 03/02/2021 14:23:24\nauto: TESTE\n\n 3;Beltrano;C;naction: Test3\ndate: 04/10/2021 09:00:00\nauto: TESTE\n\n

|id|name |desc|action|date |auto |1 |Fulano |A |Test1 |04/10/2021 09:00:00 | |1 |Fulano |A |Test1 |04/10/2021 09:00:00 | |1 |Fulano |A |Test1 |04/10/2021 09:00:00 |TESTE |2 |Ciclano |B |TEST3 |03/02/2021 14:23:24 | |2 |Ciclano |B |TEST3 |03/02/2021 14:23:24 | |2 |Ciclano |B |TEST3 |03/02/2021 14:23:24 |TESTE |3 |Beltrano|C |TEST2 |04/02/2021 14:23:24 |TESTE

2条回答

网友

1楼 · 编辑于 2024-09-27 19:23:38

您可以解析csv文件并提取最后三个字段：

import io
import pandas as pd

csv = """
id;name;desc;hist
1;Fulano;A;action: Test1\ndate: 04/10/2021 09:00:00\n\naction: Test2\ndate: 04/10/2021 09:00:00\n\naction: Test3\ndate: 04/10/2021 09:00:00\nauto: TESTE\n\n
2;Ciclano;B;action: Test1\ndate: 03/02/2021 14:23:24\n\naction: Test2\ndate: 03/02/2021 14:23:24\n\naction: Test3\ndate: 03/02/2021 14:23:24\nauto: TESTE\n\n
3;Beltrano;C;action: Test3\ndate: 04/10/2021 09:00:00\nauto: TESTE\n\n
"""

header = 'id,name,desc,action,date,auto'
new_csv = [header]

with io.StringIO(csv) as ftx:
    lines = ftx.readlines()

first_row = None
for index, line in enumerate(lines):
    if 'id;name;desc;hist' in line:
        first_row = index + 1
        break

if first_row:
    index = first_row
    end = len(lines)
    while index < end:
        
        if ';' in lines[index]:
            col_id, col_name, col_desc, hist = lines[index].split(';')
            parse = hist.split(': ')
            action, date, auto = ['', '', '']
            
            while True:
                
                if parse[0].strip() == 'action':
                    action = parse[1].strip()
                elif parse[0].strip() == 'date':
                    date = parse[1].strip()
                elif parse[0].strip() == 'auto':
                    auto = parse[1].strip()

                index += 1
                
                if index >= end:
                    new_csv.append(','.join([col_id, col_name, col_desc, action, date, auto]))
                    break
                    
                if lines[index].strip() == '':
                    new_csv.append(','.join([col_id, col_name, col_desc, action, date, auto]))
                    index += 1
                    
                    if index >= end or lines[index].strip() == '':
                        index += 1
                        break
                
                parse = lines[index].split(': ')
                
new_csv = '\n'.join(new_csv)

df = pd.read_csv(io.StringIO(new_csv), parse_dates=['date'], keep_default_na=False)

df

   id      name  desc action                date   auto
0   1    Fulano    A  Test1  2021-04-10 09:00:00       
1   1    Fulano    A  Test2  2021-04-10 09:00:00       
2   1    Fulano    A  Test3  2021-04-10 09:00:00  TESTE
3   2   Ciclano    B  Test1  2021-03-02 14:23:24       
4   2   Ciclano    B  Test2  2021-03-02 14:23:24       
5   2   Ciclano    B  Test3  2021-03-02 14:23:24  TESTE
6   3  Beltrano    C  Test3  2021-04-10 09:00:00  TESTE

网友

2楼 · 编辑于 2024-09-27 19:23:38

我只使用pandas和一个helper函数就成功了。我就是这样做的：

import pandas as pd

def list2Dict(l):
    d = dict()
    for x in l:
        k, v = tuple(x.split(": "))
        d[k] = v
    return d

df = pd.read_csv("sample1.csv", sep=";", encoding="utf-8")
df["hist"] = df["hist"].str.replace(r'\\n','|', regex=True)

#Remove empty elements
df["hist"] = df["hist"].str.split('\|\|').apply(lambda x: [e for e in x if len(e) > 0 ])

#Splits only what is needed to parse
df["hist"] = df["hist"].apply(lambda x: [list(e.split("|")) for e in x])

#Convert list of dicts to dicts
df["hist"] = df["hist"].apply(lambda x: [list2Dict(l) for l in x])

#Explode JSON column
exploded_df = df.explode('hist')

#Uses JSON normalize to create new columns
final_df = pd.merge(exploded_df.reset_index(drop=True), pd.json_normalize(exploded_df["hist"]), left_index=True, right_index=True).drop("hist", axis=1)

final_df.head(10)

   id      name desc action                 date   auto
0   1    Fulano    A  Test1  04/10/2021 09:00:00    NaN
1   1    Fulano    A  Test2  04/10/2021 09:00:00    NaN
2   1    Fulano    A  Test3  04/10/2021 09:00:00  TESTE
3   2   Ciclano    B  Test1  03/02/2021 14:23:24    NaN
4   2   Ciclano    B  Test2  03/02/2021 14:23:24    NaN
5   2   Ciclano    B  Test3  03/02/2021 14:23:24  TESTE
6   3  Beltrano    C  Test3  04/10/2021 09:00:00  TESTE

这里真正重要的是：分解包含dict列表的列，并使用json_规范化自动解析结果

谢谢

相关问题更多 >

编程相关推荐

热门问题

热门文章