如何将具有多级列和轴名称的透视表转换为行?

2024-09-27 04:30:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个下面提到的dataframe,它是使用^{}传递参数agg_func作为sum旋转dataframe的结果:

                                          counts           
RACE                   BLACK OR AFRICAN AMERICAN WHITE  All
ETHNIC                                                     
HISPANIC OR LATINO                            11    41   52
NOT HISPANIC OR LATINO                        15    71   86
All                                           26   112  138

您可以运行下面的代码将上面的数据帧加载到变量df

df = pd.DataFrame.from_dict({('counts', 'BLACK OR AFRICAN AMERICAN'): {'HISPANIC OR LATINO': 11, 'NOT HISPANIC OR LATINO': 15, 'All': 26}, ('counts', 'WHITE'): {'HISPANIC OR LATINO': 41, 'NOT HISPANIC OR LATINO': 71, 'All': 112}, ('counts', 'All'): {'HISPANIC OR LATINO': 52, 'NOT HISPANIC OR LATINO': 86, 'All': 138}}).rename_axis((None, 'RACE'), axis=1).rename_axis((['ETHNIC']), axis=0)

我正在尝试转换此数据帧,如下所述:

预期输出:

  level varName                      value counts
0     2  ETHNIC         HISPANIC OR LATINO    52
1     1    RACE  BLACK OR AFRICAN AMERICAN    11
2     1    RACE                      WHITE    41
3     2  ETHNIC     NOT HISPANIC OR LATINO    86
4     1    RACE  BLACK OR AFRICAN AMERICAN    15
5     1    RACE                      WHITE    71

上面的varName字段表示列和行的轴名称,level是1表示列轴,2表示行轴,具有索引值的行All是可选的,因此不管它是否在结果数据帧中(后来我认为如果它在那里会更好)

我已经研究了以下这么多线程,但我没有发现它们中的任何一个与我的问题有多大关系

我能够转换的一种方法是手动拾取值并手动创建所需的数据帧,如下所示:

df.index.names
# output: FrozenList(['ETHNIC'])
df.columns.names
#output: FrozenList([None, 'RACE'])
[y for x,y in df][:-1]
#output: ['BLACK OR AFRICAN AMERICAN', 'WHITE']
[x for x in df.index][:-1]
#output: ['HISPANIC OR LATINO', 'NOT HISPANIC OR LATINO']

编辑

以下是创建透视表之前的数据:

data = pd.DataFrame.from_dict({'ETHNIC': {0: 'NOT HISPANIC OR LATINO', 1: 'NOT HISPANIC OR LATINO', 2: 'NOT HISPANIC OR LATINO', 3: 'NOT HISPANIC OR LATINO', 4: 'NOT HISPANIC OR LATINO', 5: 'NOT HISPANIC OR LATINO', 6: 'NOT HISPANIC OR LATINO', 7: 'NOT HISPANIC OR LATINO', 8: 'NOT HISPANIC OR LATINO', 9: 'HISPANIC OR LATINO', 10: 'HISPANIC OR LATINO', 11: 'HISPANIC OR LATINO', 12: 'HISPANIC OR LATINO', 13: 'HISPANIC OR LATINO', 14: 'NOT HISPANIC OR LATINO', 15: 'NOT HISPANIC OR LATINO', 16: 'NOT HISPANIC OR LATINO', 17: 'NOT HISPANIC OR LATINO', 18: 'NOT HISPANIC OR LATINO', 19: 'NOT HISPANIC OR LATINO', 20: 'NOT HISPANIC OR LATINO', 21: 'HISPANIC OR LATINO', 22: 'HISPANIC OR LATINO', 23: 'NOT HISPANIC OR LATINO', 24: 'NOT HISPANIC OR LATINO', 25: 'NOT HISPANIC OR LATINO', 26: 'HISPANIC OR LATINO', 27: 'HISPANIC OR LATINO', 28: 'HISPANIC OR LATINO', 29: 'HISPANIC OR LATINO', 30: 'HISPANIC OR LATINO', 31: 'HISPANIC OR LATINO', 32: 'NOT HISPANIC OR LATINO', 33: 'HISPANIC OR LATINO', 34: 'NOT HISPANIC OR LATINO', 35: 'NOT HISPANIC OR LATINO', 36: 'NOT HISPANIC OR LATINO', 37: 'NOT HISPANIC OR LATINO', 38: 'NOT HISPANIC OR LATINO', 39: 'NOT HISPANIC OR LATINO', 40: 'NOT HISPANIC OR LATINO', 41: 'NOT HISPANIC OR LATINO', 42: 'HISPANIC OR LATINO', 43: 'NOT HISPANIC OR LATINO', 44: 'NOT HISPANIC OR LATINO', 45: 'NOT HISPANIC OR LATINO', 46: 'HISPANIC OR LATINO', 47: 'HISPANIC OR LATINO', 48: 'HISPANIC OR LATINO', 49: 'HISPANIC OR LATINO', 50: 'NOT HISPANIC OR LATINO', 51: 'NOT HISPANIC OR LATINO', 52: 'NOT HISPANIC OR LATINO', 53: 'HISPANIC OR LATINO', 54: 'HISPANIC OR LATINO', 55: 'HISPANIC OR LATINO', 56: 'NOT HISPANIC OR LATINO', 57: 'HISPANIC OR LATINO', 58: 'HISPANIC OR LATINO', 59: 'NOT HISPANIC OR LATINO', 60: 'NOT HISPANIC OR LATINO', 61: 'HISPANIC OR LATINO', 62: 'HISPANIC OR LATINO', 63: 'HISPANIC OR LATINO', 64: 'HISPANIC OR LATINO', 65: 'NOT HISPANIC OR LATINO', 66: 'NOT HISPANIC OR LATINO', 67: 'NOT HISPANIC OR LATINO', 68: 'NOT HISPANIC OR LATINO', 69: 'HISPANIC OR LATINO', 70: 'NOT HISPANIC OR LATINO', 71: 'NOT HISPANIC OR LATINO', 72: 'HISPANIC OR LATINO', 73: 'HISPANIC OR LATINO', 74: 'HISPANIC OR LATINO', 75: 'NOT HISPANIC OR LATINO', 76: 'NOT HISPANIC OR LATINO', 77: 'NOT HISPANIC OR LATINO', 78: 'NOT HISPANIC OR LATINO', 79: 'NOT HISPANIC OR LATINO', 80: 'NOT HISPANIC OR LATINO', 81: 'NOT HISPANIC OR LATINO', 82: 'HISPANIC OR LATINO', 83: 'HISPANIC OR LATINO', 84: 'HISPANIC OR LATINO', 85: 'NOT HISPANIC OR LATINO', 86: 'HISPANIC OR LATINO', 87: 'HISPANIC OR LATINO', 88: 'HISPANIC OR LATINO', 89: 'NOT HISPANIC OR LATINO', 90: 'NOT HISPANIC OR LATINO', 91: 'NOT HISPANIC OR LATINO', 92: 'NOT HISPANIC OR LATINO', 93: 'NOT HISPANIC OR LATINO', 94: 'NOT HISPANIC OR LATINO', 95: 'HISPANIC OR LATINO', 96: 'HISPANIC OR LATINO', 97: 'HISPANIC OR LATINO', 98: 'NOT HISPANIC OR LATINO', 99: 'NOT HISPANIC OR LATINO', 100: 'NOT HISPANIC OR LATINO', 101: 'NOT HISPANIC OR LATINO', 102: 'NOT HISPANIC OR LATINO', 103: 'NOT HISPANIC OR LATINO', 104: 'NOT HISPANIC OR LATINO', 105: 'NOT HISPANIC OR LATINO', 106: 'NOT HISPANIC OR LATINO', 107: 'NOT HISPANIC OR LATINO', 108: 'NOT HISPANIC OR LATINO', 109: 'HISPANIC OR LATINO', 110: 'HISPANIC OR LATINO', 111: 'NOT HISPANIC OR LATINO', 112: 'NOT HISPANIC OR LATINO', 113: 'NOT HISPANIC OR LATINO', 114: 'NOT HISPANIC OR LATINO', 115: 'HISPANIC OR LATINO', 116: 'HISPANIC OR LATINO', 117: 'NOT HISPANIC OR LATINO', 118: 'HISPANIC OR LATINO', 119: 'HISPANIC OR LATINO', 120: 'NOT HISPANIC OR LATINO', 121: 'HISPANIC OR LATINO', 122: 'HISPANIC OR LATINO', 123: 'HISPANIC OR LATINO', 124: 'HISPANIC OR LATINO', 125: 'HISPANIC OR LATINO', 126: 'NOT HISPANIC OR LATINO', 127: 'NOT HISPANIC OR LATINO', 128: 'NOT HISPANIC OR LATINO', 129: 'NOT HISPANIC OR LATINO', 130: 'NOT HISPANIC OR LATINO', 131: 'NOT HISPANIC OR LATINO', 132: 'NOT HISPANIC OR LATINO', 133: 'NOT HISPANIC OR LATINO', 134: 'NOT HISPANIC OR LATINO', 135: 'NOT HISPANIC OR LATINO', 136: 'NOT HISPANIC OR LATINO', 137: 'NOT HISPANIC OR LATINO'}, 'RACE': {0: 'WHITE', 1: 'WHITE', 2: 'WHITE', 3: 'WHITE', 4: 'WHITE', 5: 'WHITE', 6: 'WHITE', 7: 'WHITE', 8: 'WHITE', 9: 'BLACK OR AFRICAN AMERICAN', 10: 'BLACK OR AFRICAN AMERICAN', 11: 'BLACK OR AFRICAN AMERICAN', 12: 'BLACK OR AFRICAN AMERICAN', 13: 'BLACK OR AFRICAN AMERICAN', 14: 'WHITE', 15: 'WHITE', 16: 'WHITE', 17: 'WHITE', 18: 'WHITE', 19: 'WHITE', 20: 'BLACK OR AFRICAN AMERICAN', 21: 'WHITE', 22: 'WHITE', 23: 'WHITE', 24: 'BLACK OR AFRICAN AMERICAN', 25: 'BLACK OR AFRICAN AMERICAN', 26: 'WHITE', 27: 'WHITE', 28: 'WHITE', 29: 'WHITE', 30: 'WHITE', 31: 'WHITE', 32: 'WHITE', 33: 'WHITE', 34: 'WHITE', 35: 'WHITE', 36: 'WHITE', 37: 'WHITE', 38: 'WHITE', 39: 'WHITE', 40: 'BLACK OR AFRICAN AMERICAN', 41: 'BLACK OR AFRICAN AMERICAN', 42: 'WHITE', 43: 'WHITE', 44: 'WHITE', 45: 'WHITE', 46: 'WHITE', 47: 'WHITE', 48: 'WHITE', 49: 'WHITE', 50: 'WHITE', 51: 'BLACK OR AFRICAN AMERICAN', 52: 'BLACK OR AFRICAN AMERICAN', 53: 'WHITE', 54: 'WHITE', 55: 'WHITE', 56: 'WHITE', 57: 'WHITE', 58: 'WHITE', 59: 'WHITE', 60: 'WHITE', 61: 'WHITE', 62: 'WHITE', 63: 'WHITE', 64: 'WHITE', 65: 'BLACK OR AFRICAN AMERICAN', 66: 'BLACK OR AFRICAN AMERICAN', 67: 'BLACK OR AFRICAN AMERICAN', 68: 'BLACK OR AFRICAN AMERICAN', 69: 'WHITE', 70: 'WHITE', 71: 'WHITE', 72: 'WHITE', 73: 'WHITE', 74: 'BLACK OR AFRICAN AMERICAN', 75: 'WHITE', 76: 'WHITE', 77: 'WHITE', 78: 'WHITE', 79: 'WHITE', 80: 'BLACK OR AFRICAN AMERICAN', 81: 'BLACK OR AFRICAN AMERICAN', 82: 'BLACK OR AFRICAN AMERICAN', 83: 'BLACK OR AFRICAN AMERICAN', 84: 'BLACK OR AFRICAN AMERICAN', 85: 'BLACK OR AFRICAN AMERICAN', 86: 'WHITE', 87: 'WHITE', 88: 'WHITE', 89: 'WHITE', 90: 'WHITE', 91: 'WHITE', 92: 'WHITE', 93: 'WHITE', 94: 'WHITE', 95: 'WHITE', 96: 'WHITE', 97: 'WHITE', 98: 'WHITE', 99: 'WHITE', 100: 'WHITE', 101: 'WHITE', 102: 'WHITE', 103: 'WHITE', 104: 'WHITE', 105: 'WHITE', 106: 'WHITE', 107: 'WHITE', 108: 'BLACK OR AFRICAN AMERICAN', 109: 'WHITE', 110: 'WHITE', 111: 'WHITE', 112: 'WHITE', 113: 'WHITE', 114: 'WHITE', 115: 'BLACK OR AFRICAN AMERICAN', 116: 'BLACK OR AFRICAN AMERICAN', 117: 'WHITE', 118: 'WHITE', 119: 'WHITE', 120: 'WHITE', 121: 'WHITE', 122: 'WHITE', 123: 'WHITE', 124: 'WHITE', 125: 'WHITE', 126: 'WHITE', 127: 'WHITE', 128: 'WHITE', 129: 'WHITE', 130: 'WHITE', 131: 'WHITE', 132: 'WHITE', 133: 'WHITE', 134: 'WHITE', 135: 'WHITE', 136: 'WHITE', 137: 'WHITE'}})

以下是旋转的代码:

df = (data.groupby(['ETHNIC', 'RACE'])
      .size()
      .to_frame('counts')
      .reset_index(level=['ETHNIC', 'RACE'])
      .pivot_table(index='ETHNIC', columns='RACE', aggfunc='sum', margins=True, dropna=False)
      )

PS请注意,预期数据帧中的行顺序很重要

更新

正如在评论中所建议的,我尝试了pd.crosstab,发现它是 创建与我使用pd.pivot_table创建的聚合df相同的聚合df(在一个有200K行的数据帧上测试)的速度几乎慢了2倍


Tags: or数据dfindexnotallblackamerican
3条回答

我通过使用helper函数(使用原始数据)成功地使其工作:

def agg_data(g):
    df_race = (
        g.groupby('RACE').size().to_frame('count')
        .rename_axis(index='value').reset_index()
        .assign(level=1, varName='RACE')
        [['level', 'varName', 'value', 'count']]
    )
    
    df_ethnic = (
        pd.DataFrame([[2, 'ETHNIC', g.ETHNIC.iloc[0], len(g)]], columns=df_race.columns)
    )
    
    return pd.concat([df_ethnic, df_race])

df.groupby(['ETHNIC']).apply(agg_data).reset_index(drop=True)


    level   varName value                       count
0   2       ETHNIC  HISPANIC OR LATINO          52
1   1       RACE    BLACK OR AFRICAN AMERICAN   11
2   1       RACE    WHITE                       41
3   2       ETHNIC  NOT HISPANIC OR LATINO      86
4   1       RACE    BLACK OR AFRICAN AMERICAN   15
5   1       RACE    WHITE                       71

我的想法是:

df = (
    df.iloc[:-1][['All', *df.columns.difference(['All'])]]
        .stack()
        .reset_index(name='count')
        .rename(columns={'RACE': 'value', 'ETHNIC': 'varName'})
)

m = df['varName'].ne(df['varName'].shift())
df['value'] = np.where(m, df['varName'], df['value'])
df['varName'] = np.where(m, 'ETHNIC', 'RACE')
df['level'] = m + 1

df = df[['level', 'varName', 'value', 'count']]

df

   level varName                      value  count
0      2  ETHNIC         HISPANIC OR LATINO     52
1      1    RACE  BLACK OR AFRICAN AMERICAN     11
2      1    RACE                      WHITE     41
3      2  ETHNIC     NOT HISPANIC OR LATINO     86
4      1    RACE  BLACK OR AFRICAN AMERICAN     15
5      1    RACE                      WHITE     71

首先去除底部边距并重新排序列:

df.iloc[:-1][['All', *df.columns.difference(['All'])]]
RACE                    All  BLACK OR AFRICAN AMERICAN  WHITE
ETHNIC                                                       
HISPANIC OR LATINO       52                         11     41
NOT HISPANIC OR LATINO   86                         15     71

然后stackrename

df.iloc[:-1][['All', *df.columns.difference(['All'])]]
        .stack()
        .reset_index(name='count')
        .rename(columns={'RACE': 'value', 'ETHNIC': 'varName'})
                  varName                      value  count
0      HISPANIC OR LATINO                        All     52
1      HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     11
2      HISPANIC OR LATINO                      WHITE     41
3  NOT HISPANIC OR LATINO                        All     86
4  NOT HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     15
5  NOT HISPANIC OR LATINO                      WHITE     71

然后,其余的基于布尔索引varName

m = df['varName'].ne(df['varName'].shift())
0     True
1    False
2    False
3     True
4    False
5    False
Name: varName, dtype: bool

varName移到value

df['value'] = np.where(m, df['varName'], df['value'])
                  varName                      value  count
0      HISPANIC OR LATINO         HISPANIC OR LATINO     52
1      HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     11
2      HISPANIC OR LATINO                      WHITE     41
3  NOT HISPANIC OR LATINO     NOT HISPANIC OR LATINO     86
4  NOT HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     15
5  NOT HISPANIC OR LATINO                      WHITE     71

基于m分配levelvarName

df['varName'] = np.where(m, 'ETHNIC', 'RACE')
df['level'] = m + 1
  varName                      value  count  level
0  ETHNIC         HISPANIC OR LATINO     52      2
1    RACE  BLACK OR AFRICAN AMERICAN     11      1
2    RACE                      WHITE     41      1
3  ETHNIC     NOT HISPANIC OR LATINO     86      2
4    RACE  BLACK OR AFRICAN AMERICAN     15      1
5    RACE                      WHITE     71      1

最后,重新排列列:

df = df[['level', 'varName', 'value', 'count']]

我现在能想到的唯一方法是:

df_flat = pd.crosstab(data['ETHNIC'], data['RACE'])
l = []
for n,g in df_flat.stack().groupby(level=0):
    l.append(g.sum(level=0).rename('count').to_frame().assign(level=2, varname=g.index.names[0]))
    l.append(g.droplevel(level=0).rename('count').to_frame().assign(level=1, varname=g.index.names[1]))
df_out = pd.concat(l).reset_index()
df_out

输出:

                       index  count  level varname
0         HISPANIC OR LATINO     52      2  ETHNIC
1  BLACK OR AFRICAN AMERICAN     11      1    RACE
2                      WHITE     41      1    RACE
3     NOT HISPANIC OR LATINO     86      2  ETHNIC
4  BLACK OR AFRICAN AMERICAN     15      1    RACE
5                      WHITE     71      1    RACE

我们也可以从多索引中获取那些级别名称种族和种族

相关问题 更多 >

    热门问题