如何在pandas中的每一行中使用多个(类似字典的)json对象分解列?

2024-10-04 11:29:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含一些列的csv文件。感兴趣的列在一行中有多个json对象。它看起来像这样:

IN: df=read_csv('filename.tsv',sep='\t')
IN: df

OUT: name RSN model version dt  si2 si3 pi1 wi20    wi28    li1 ci1 ai1 ai2 ai3 ad1 wi19    wi27    wan2    wan1    li3 li2 li5 li4 li7 li6 li9 li8 wi22    wi21    wi24    wi23    wi26    wi25    wi30    wi29    wi14    wi13    wi16    wi15    wi17    wi18
   0    DE1 RSN JCO4032 R2.15   12-03-21 06:53:32:155   14  46  831 5   149 2   0   NaN NaN NaN NaN 0   0   218419  553198  1754335 32208167    18594   28750   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   NaN NaN NaN 
   1    DE1 RSN JCO4032 R2.15   12-03-21 06:54:04:343   14  46  863 5   149 2   0   NaN NaN NaN NaN 0   0   9063    209 99335   1941734 1084    1598    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   NaN NaN NaN 
   2    DE1 RSN JCO4032 R2.15   12-03-21 07:04:07:579   13  46  1469    5   149 2   0   NaN NaN NaN NaN 0   0   152680  18355   1656295 29541773    17201   25804   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   NaN NaN NaN 

IN: df.wi17
OUT: 
    35                                                  NaN
    36                                                  NaN
    37    [{"mac":"2xx01:xxF","rssi":-60,"txrate...
    38    [{"mac":"20:4xx:1F","rssi":-60,"txrate...
    39                                                  NaN
Name: wi17, dtype: object

IN: df.wi17[37]
OUT: '[{"mac":"20:47xx:1F","rssi":-60,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"E8xx:A0","rssi":-57,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]'

我使用json.loads将这列字符串转换为一列字典

def parser2(d):
   if d!=d:
      return np.nan
   else:
      return json.loads(d)
df.wi17 = df.wi17.apply(parser2)

我正在寻找一个优雅的解决方案来分解这些词典,并根据一个独特的“mac”对它们进行分组,然后再根据原始df中的一个独特的'RSN'对它们进行分组

它应该是这样的:

... RSN         .... mac        rssi  txrate  max_txrate  txbytes  rxbytes   nxn  ...
... RSNFDXXXKDF ... 2A:xxxx:sd   30   34      50          2323     34323     1x1  ...
... RSNFDXXXKDF ... 2A:xxxx:sd   50   84      70          20       2334343   1x1  ...
... RSNFDXXXKDF ... 3B:yyyy:sd   45   48      47          40       2334      2x2  ...
... RSNFDXXXKDF ... Nan         Nan   Nan     Nan         Nan      Nan       Nan  ...
... ADKNCCJXKDF ... AA:yyyy:sd   45   48      47          40       2334      2x2  ...

有什么建议吗


Tags: injsondfmacsdnanoutrssi
1条回答
网友
1楼 · 发布于 2024-10-04 11:29:59

让我们在列wi17的数据上使用^{}^{}+^{},然后使用^{}

df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
                 axis=1).reset_index()

在这里,我们使用df2.apply(lambda x: pd.Series(x.wi17), axis=1)on df2,它与字典列表一起分解为df2每一行中的一个字典。将lambda函数与pd.Series一起使用,将字典及其相应的字典键、值展开为列索引和列值

演示运行

测试数据构造

data = {'name': ['DE1', 'DE2', 'DE3'], 'RSN': ['RSNJCO4032', 'RSNJCO4033', 'RSNJCO4034']}
df = pd.DataFrame(data)
df['wi17'] = ['[{"mac":"20:47xx:1F","rssi":-60,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"E8xx:A0","rssi":-57,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]', '[{"mac":"40:17xx:1F","rssi":-62,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"F8xx:B0","rssi":-58,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]', '[{"mac":"60:07xx:1F","rssi":-64,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"A8xx:C0","rssi":-61,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]']

import json
def parser2(d):
   if d!=d:
      return np.nan
   else:
      return json.loads(d)
df.wi17 = df.wi17.apply(parser2)

print(df)

  name         RSN                                                                                                                                                                                                                                                wi17
0  DE1  RSNJCO4032  [{'mac': '20:47xx:1F', 'rssi': -60, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'E8xx:A0', 'rssi': -57, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]
1  DE2  RSNJCO4033  [{'mac': '40:17xx:1F', 'rssi': -62, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'F8xx:B0', 'rssi': -58, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]
2  DE3  RSNJCO4034  [{'mac': '60:07xx:1F', 'rssi': -64, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'A8xx:C0', 'rssi': -61, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]

运行新代码

df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
                 axis=1).reset_index()
print(df3)

输出:

  name         RSN         mac  rssi  txrate  max_txrate     txbytes    rxbytes  nxn
0  DE1  RSNJCO4032  20:47xx:1F   -60    72.0        72.0           0          0  1x1
1  DE1  RSNJCO4032     E8xx:A0   -57    72.0        72.0  1414810891  808725830  1x1
2  DE2  RSNJCO4033  40:17xx:1F   -62    72.0        72.0           0          0  1x1
3  DE2  RSNJCO4033     F8xx:B0   -58    72.0        72.0  1414810891  808725830  1x1
4  DE3  RSNJCO4034  60:07xx:1F   -64    72.0        72.0           0          0  1x1
5  DE3  RSNJCO4034     A8xx:C0   -61    72.0        72.0  1414810891  808725830  1x1

编辑

为了获得更好的系统性能(执行时间),您可以尝试将.apply()函数更改为list(map(...),如下所示:

df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 pd.DataFrame(list(map(pd.Series, df2['wi17'])), index=df2.index)],
                 axis=1).reset_index()

编辑2

系统性能(执行时间)将进一步微调。基准测试表明,通过使用^{}将json结构扩展到新的数据帧中,以合并到原始数据帧中,执行时间可以加快20倍以上

df2 = df.explode('wi17')
df2['wi17'] = df2['wi17'].fillna({i: {} for i in df2.index})  # as suggested by @TwerkingPanda to handle NaN entries. 
df3 = pd.concat([df2.drop('wi17', axis=1).reset_index(drop=True), 
                 pd.json_normalize(df2['wi17'])],
                 axis=1).reset_index()

用30000行测试系统性能(每行2个json,因此总共60000个json)

df1 = pd.concat([df] * 10000, ignore_index=True)
df2 = df1.explode('wi17')

(1)使用.apply()pd.Series()

%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
                 axis=1).reset_index()
21.9 s ± 82.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(2)使用list(map(...))pd.Series()pd.DataFrame()的修订版基准:

%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 pd.DataFrame(list(map(pd.Series, df2['wi17'])), index=df2.index)],
                 axis=1).reset_index()
20.6 s ± 364 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(3)使用pd.json_normalize()的修订版基准:

%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1).reset_index(drop=True), 
                 pd.json_normalize(df2['wi17'])],
                 axis=1).reset_index()
999 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Edt 3

此版本进一步提高了系统性能(执行时间),同时允许保留OP的原始系列索引,以确保数据完整性

由于json对象处于一个没有嵌套的级别,我们可以使用更高效的DataFrame构造函数pd.DataFrame()将json字段扩展为列,如下所示:

df3 = pd.concat([df2.drop('wi17', axis=1), 
                 pd.DataFrame(df2['wi17'].to_list(), index=df2.index)],
                 axis=1)

使用^{的版本基准:

%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 pd.DataFrame(df2['wi17'].to_list(), index=df2.index)],
                 axis=1)
116 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

此版本比使用pd.json_normalize()的版本快8倍以上,比使用df.apply()+pd.Series()的版本快180倍以上。此外,通过使用pd.DataFrame()index=参数,我们可以灵活地在扩展列上保留原始数据帧的索引

相关问题 更多 >