从行计数中获取汇总的透视数据帧

2024-10-03 21:26:05 发布

您现在位置:Python中文网/ 问答频道 /正文

给定此数据帧:

    bowl    cookie
0   one     chocolate
1   two     chocolate
2   two     chocolate
3   two     vanilla
4   one     vanilla
5   one     vanilla
6   one     vanilla
7   one     vanilla
8   one     vanilla
9   two     chocolate

我想获得以下摘要数据帧:

        vanilla     chocolate
one     5           1
two     1           3

除了手动操作外:

vanilla_bowl1 = len(df_picks[(df_picks['bowl'] == 'one') & (df_picks['cookie'] == 'vanilla')])
vanilla_bowl2 = len(df_picks[(df_picks['bowl'] == 'two') & (df_picks['cookie'] == 'vanilla')])
chocolate_bowl1 = ...
chocolate_bowl2 = ...

有没有办法用Pandas在单个操作中做到这一点?你知道吗


注意:我已经查看了df.pivot(),如果我在每一行中添加一列count等于1,这将起作用:

    bowl    cookie      count
0   one     chocolate       1
1   two     chocolate       1
2   two     chocolate       1
3   two     vanilla         1
4   one     vanilla         1
5   one     vanilla         1
6   one     vanilla         1
7   one     vanilla         1
8   one     vanilla         1
9   two     chocolate       1

然后呢

df.pivot(index='bowl', columns='cookie', values='count')

但是,我想知道是否有更直接的方法,它不需要首先添加count列。你知道吗


Tags: 数据dflencookiecount手动onepivot
3条回答

愚蠢的做法

from itertools import product
import pandas as pd
import numpy as np

def pir(df):
    ub = pd.Index(np.unique(df.values[:, 0]), name='bowl')
    uc = pd.Index(np.unique(df.values[:, 1]), name='cookie')
    u = np.array(list(product(ub.values, uc.values)))
    e = u[:, None] == df.values

    return pd.DataFrame(
        e.all(2).sum(1).reshape(-1, 2),
        ub, uc
    )

pir(df)

enter image description here

可以使用pivot_table()方法:

In [33]: df.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
Out[33]:
cookie  chocolate  vanilla
bowl
one             1        5
two             3        1

或者,您可以使用groupby()size()unstack()——这就是pivot_table()在引擎盖下的方式:

In [36]: df.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
Out[36]:
cookie  chocolate  vanilla
bowl
one             1        5
two             3        1

10万行数据框的计时:

In [48]: big = pd.concat([df] * 10**4, ignore_index=True)

In [49]: big.shape
Out[49]: (100000, 2)

In [50]: %timeit pd.crosstab(big.bowl, big.cookie)
10 loops, best of 3: 58 ms per loop

In [51]: %timeit big.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
10 loops, best of 3: 38.4 ms per loop

In [52]: %timeit big.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
10 loops, best of 3: 34.2 ms per loop

In [118]: %timeit pir(big)
1 loop, best of 3: 631 ms per loop

In [119]: big.shape
Out[119]: (100000, 2)

1M排DF的计时:

In [53]: big = pd.concat([big] * 10, ignore_index=True)

In [54]: big.shape
Out[54]: (1000000, 2)

In [55]: %timeit pd.crosstab(big.bowl, big.cookie)
1 loop, best of 3: 446 ms per loop

In [56]: %timeit big.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0)
1 loop, best of 3: 333 ms per loop

In [57]: %timeit big.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0)
1 loop, best of 3: 327 ms per loop

In [121]: %timeit pir(big)
1 loop, best of 3: 7.08 s per loop

In [122]: big.shape
Out[122]: (1000000, 2)

最简洁的方法可能是pandas.crosstab函数:

>>> pandas.crosstab(d.bowl, d.cookie)
cookie  chocolate  vanilla
bowl                      
one             1        5
two             3        1

相关问题 更多 >