多索引中的计算列

2024-06-01 06:24:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在数据帧中插入2列。在

原始数据帧

card    auth       month   order_number
Amex     A        2017-11       1234
Visa     A        2017-12       2345
Amex     D        2017-12       3456

我想按月份细分身份。我使用了以下代码:

^{pr2}$

按月数据帧

   month         2017-11      2017-12
    auth         A    D       A    D
    card
    mastercard  10    11     11    10
    amex        19    20     10    11
    visa        50    30     50    1

目标结果

我想为subtotalauth_rate添加列

       month                   2017-11                       2017-12
        auth         A    D   total    pct           A    D    total  pct
        card
        mastercard  10    11     21    .47           11    10   21    .52
        amex        19    20     39    .49           10    11   21    .47
        visa        50    30     80    .63           50    1    51    .98

我在创建这些列时遇到问题。Thislink按行显示小计,但它不能转换为列或计算列。在

感谢任何帮助!在


Tags: 数据authnumber原始数据visaordercardmastercard
2条回答

刚刚在Pandas 0.17.0Python2.7.5上进行了测试,现在我可以理解为什么您问我有关重新索引(axis=1)和df1.columns.levels[1]之前的'*'的问题。这确实是来自Pandas和Python的版本问题。我修改了代码以运行上面提到的旧版本,还修复了一个潜在的错误,以防在同一个Pivot表中需要对多个通用描述性统计进行后期计算。展望未来,在未来的帖子中更容易提到软件的版本(如果它们是旧版本的话),这样就会减少误解:

import pandas as pd

str = """card    auth   month   order_number
Amex     A        2017-11       1234
Visa     A        2017-12       2345
Amex     D        2017-12       3416
MC       A        2017-12       3426
Visa     A        2017-11       3436
Amex     D        2017-12       3446
Visa     A        2017-11       3466
Amex     D        2017-12       3476
Visa     D        2017-11       3486
"""

# create dataframe from the above sample data
df = pd.read_table(pd.io.common.StringIO(str), sep='\s+')

# create the pivot_table using the method OP supplied
df1 = df.pivot_table(index='card', columns=['month', 'auth'], values='order_number', aggfunc='count')
print(df1)
# month 2017-11      2017-12     
# auth        A    D       A    D
# card                           
# Amex      1.0  NaN     NaN  3.0
# MC        NaN  NaN     1.0  NaN
# Visa      2.0  1.0     1.0  NaN

# create an empty dataframe with the same index/column layout as df1
# except the level-1 in columns
idx = pd.MultiIndex.from_product([df1.columns.levels[0], ['total', 'avg', 'std', 'pct']], names=df1.columns.names)
df2 = pd.DataFrame(columns=idx, index=df1.index).sort_index(axis=1)

print(df2)
# month 2017-11                 2017-12                
# auth      avg  pct  std total     avg  pct  std total
# card                                                 
# Amex      NaN  NaN  NaN   NaN     NaN  NaN  NaN   NaN
# MC        NaN  NaN  NaN   NaN     NaN  NaN  NaN   NaN
# Visa      NaN  NaN  NaN   NaN     NaN  NaN  NaN   NaN

# Calculate the common stats:
df2.loc[:,(slice(None),'total')] = df1.groupby(level=0, axis=1).sum().values
df2.loc[:,(slice(None),'avg')]   = df1.groupby(level=0, axis=1).mean().values
df2.loc[:,(slice(None),'std')]   = df1.groupby(level=0, axis=1).std().values

# join df2 with df1 and assign the result to df3 (can also overwrite df1): 
df3 = df1.join(df2).sort_index(axis=1)

# calculate `pct` which needs both a calculated field and an original field
# auth-rate = A / total
df3.loc[:,(slice(None),'pct')] = df3.groupby(level=0, axis=1)\
                                    .apply(lambda x: x.loc[:,(slice(None),'A')].values/x.loc[:,(slice(None),'total')].values) \
                                    .values

print(df3)
# month 2017-11                                    2017-12                      
# auth        A   D  avg       pct       std total       A   D avg pct std total
# card                                                                          
# Amex        1 NaN  1.0  1.000000       NaN     1     NaN   3   3 NaN NaN     3
# MC        NaN NaN  NaN       NaN       NaN   NaN       1 NaN   1   1 NaN     1
# Visa        2   1  1.5  0.666667  0.707107     3       1 NaN   1   1 NaN     1

# rounding if needed:
df3.loc[:,(slice(None),'pct')] = df3.loc[:,(slice(None),'pct')].round(decimals=2)

如果要按特定顺序对1级列进行排序,可以执行reindex()。在

^{pr2}$

使用:

#create sum by first level of MultiIndex
df1 = df.sum(axis=1, level=0)
df1.columns = [df1.columns, ['total'] * len(df1.columns)]
print (df1)
month      2017-11 2017-12
             total   total
card                      
mastercard      21      21
amex            39      21
visa            80      51

#select by second level and divide
df2 = df.xs('A', axis=1, level=1).div(df1.xs('total', axis=1, level=1)).round(2)
df2.columns = [df2.columns, ['pct'] * len(df2.columns)]
print (df2)
month      2017-11 2017-12
               pct     pct
card                      
mastercard    0.48    0.52
amex          0.49    0.48
visa          0.62    0.98

#join all together, sort MultiIndex
df3 = pd.concat([df, df1, df2], axis=1).sort_index(axis=1)
print (df3)
month      2017-11                 2017-12                
auth             A   D   pct total       A   D   pct total
card                                                      
mastercard      10  11  0.48    21      11  10  0.52    21
amex            19  20  0.49    39      10  11  0.48    21
visa            50  30  0.62    80      50   1  0.98    51

^{pr2}$

相关问题 更多 >