在多索引数据帧上使用多维索引？

p_z entry subentry 0 0 0.338738 1 0.636035 2 -0.307365 3 -0.167779 4 0.243284 ... ... 26692 891 -0.459227 892 0.055993 893 -0.469857 894 0.192554 895 0.155738 [11742280 rows x 1 columns]

tofpid entry subentry 0 0 0 1 2 2 4 3 5 4 7 ... ... 26692 193 649 194 670 195 690 196 725 197 737 [2006548 rows x 1 columns]

import awkward as ak import pandas as pd >>> p_z = ak.Array([[ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284, 0.338738, 0.636035], [-0.459227, 0.055993, -0.469857, 0.192554, 0.155738, -0.459227]]) >>> p_z = ak.to_pandas(p_z) >>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]]) >>> tofpid = ak.to_pandas(tofpid)

2条回答

网友

1楼 · 编辑于 2024-09-25 00:27:06

IIUC：

输入数据：

>>> p_z
                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284

>>> tofpid
                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7

从第二个数据帧的列（条目，toffid）创建新的多索引：

mi = pd.MultiIndex.from_frame(tofpid.reset_index(level='subentry', drop=True)
                                    .reset_index())

输出结果：

>>> p_z.loc[mi.intersection(p_z.index)]
              p_z
entry
0     0  0.338738
      2 -0.307365
      4  0.243284

网友

2楼 · 编辑于 2024-09-25 00:27:06

下面是一个可复制的示例，其结构足以表示问题（使用awkward库）：

>>> import awkward as ak
>>> 
>>> p_z = ak.Array([
...     [ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  0.338738, 0.636035],
...     [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, -0.459227],
... ])
>>> p_z
<Array [[0.339, 0.636, ... 0.156, -0.459]] type='2 * var * float64'>
>>> 
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid
<Array [[0, 2, 4, 5], [1, 2, 4]] type='2 * var * int64'>

以熊猫的形式，这是：

>>> df_p_z = ak.to_pandas(p_z)
>>> df_p_z
                  values
entry subentry          
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
      5         0.338738
      6         0.636035
1     0        -0.459227
      1         0.055993
      2        -0.469857
      3         0.192554
      4         0.155738
      5        -0.459227
>>> df_tofpid = ak.to_pandas(tofpid)
>>> df_tofpid
                values
entry subentry        
0     0              0
      1              2
      2              4
      3              5
1     0              1
      1              2
      2              4

作为一个笨拙的数组，您要做的是slice the first array by the second。也就是说，您需要p_z[tofpid]：

>>> p_z[tofpid]
<Array [[0.339, -0.307, ... -0.47, 0.156]] type='2 * var * float64'>
>>> p_z[tofpid].tolist()
[[0.338738, -0.307365, 0.243284, 0.338738], [0.055993, -0.469857, 0.155738]]

使用熊猫，我成功地做到了这一点：

>>> df_p_z.loc[df_tofpid.reset_index(level=0).apply(lambda x: tuple(x.values), axis=1).tolist()]
                  values
entry subentry          
0     0         0.338738
      2        -0.307365
      4         0.243284
      5         0.338738
1     1         0.055993
      2        -0.469857
      4         0.155738

这里发生的事情是df_tofpid.reset_index(level=0)将多索引的"entry"部分转换为一列，然后apply对每一行执行一个Python函数，如果axis=1，每一行都是x.values，并且tolist()将结果转换为一个元组列表，如

>>> df_tofpid.reset_index(level=0).apply(lambda x: tuple(x.values), axis=1).tolist()
[(0, 0), (0, 2), (0, 4), (0, 5), (1, 1), (1, 2), (1, 4)]

这就是loc需要从其多索引中选择条目/子条目对的内容

我的Pandas解决方案有两个缺点：它很复杂，需要经过Python迭代和对象，不能像数组那样扩展熊猫专家很有可能找到比我更好的解决方案<我对熊猫有很多不了解

相关问题更多 >

编程相关推荐

热门问题

热门文章