我将一个数据帧按年份分组(这是列的多索引的一个级别),应用一个函数将df填充为11列(根据需要添加任意多个空列),然后返回填充的df。但这引起了一个错误。在
finalFormat = (penultimateFormatNot11Columns.groupby( level = 'Year',
axis = 1 )
.apply( padDFToXColumns )
)
raise ValueError("cannot reindex from a duplicate axis")
在应用的padding函数中,返回的paddedDF在两个轴上都没有任何重复的级别
^{pr2}$你知道这个错误是从哪里来的吗?在
填充函数
def padDFToXColumns( df, TOT_COLUMNS = 11 ):
"""
Pad out the number of columns in df to TOT_COLUMNS (add TOT_COLUMNS - len(df) empty columns)
"""
numColsInDF = len(df.columns)
if numColsInDF > TOT_COLUMNS:
print("ERROR: Number Of Columns (%s) Exceeds Max Columns (%s)" % (numColsInDF, TOT_COLUMNS))
return
### Add Empty Columns ###
numColsToAdd = TOT_COLUMNS - numColsInDF
columnsToAdd = [ 'EmptyColumn' + str(num) for num in range(numColsInDF + 1, TOT_COLUMNS + 1) ]
emptyColumns = pd.DataFrame( columns = columnsToAdd, index = np.arange(len(df.index)) )
paddedDF = df.join(emptyColumns)
#paddedDF.reset_index( drop = True, inplace = True )
return paddedDF
数据帧
>>> mydata.head()
SurveyYear Age Race Gender WeightAdjusted
0 1996 39 1.White 1.Female 1039.13
1 1996 9 1.White 2.Male 995.13
2 1996 8 1.White 2.Male 775.66
3 1996 39 1.White 2.Male 404.28
4 1996 33 3.Hispanic 1.Female 404.28
>>> groupbyKeys = ['SurveyYear', 'Age', 'Race', 'Gender']
>>> cellPopulations = mydata.groupby(groupbyKeys).agg( {'WeightAdjusted':'sum'})
>>> cellPopulations.head(20)
WeightAdjusted
SurveyYear Age Race Gender
1996 0 1.White 1.Female 1204859.60
2.Male 1227666.34
2.Black 1.Female 307495.16
2.Male 263571.07
3.Hispanic 1.Female 320359.68
2.Male 392902.80
4.Asian 1.Female 78615.49
2.Male 82341.54
5.Other 1.Female 16134.33
2.Male 19365.76
1 1.White 1.Female 1195134.70
2.Male 1195659.14
2.Black 1.Female 328376.10
2.Male 383293.79
3.Hispanic 1.Female 322862.58
2.Male 404322.04
4.Asian 1.Female 79499.56
2.Male 73783.69
5.Other 1.Female 20647.55
2.Male 24222.52
>>> unstackKey = ['SurveyYear', 'Age', 'Gender']
>>> penultimateFormatNot11Columns = cellPopulations.unstack(unstackKey)
>>> penultimateFormatNot11Columns
WeightAdjusted ...
SurveyYear 1996 ... 1997
Age 0 1 2 3 4 ... 76 77 78 79 80
Gender 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male ... 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male 1.Female 2.Male
Race ...
1.White 1204859.60 1227666.34 1195134.70 1195659.14 1197386.21 1288700.89 1251324.65 1307458.14 1236790.33 1374989.75 ... 764103.31 506844.04 702775.64 425705.16 666705.33 423419.49 577674.82 366109.58 3898404.40 2283771.11
2.Black 307495.16 263571.07 328376.10 383293.79 291976.23 326400.85 310870.61 323344.13 301025.43 323199.08 ... 68272.99 43254.98 50082.98 34347.45 50788.70 36772.29 31393.21 20720.47 366569.11 180108.23
3.Hispanic 320359.68 392902.80 322862.58 404322.04 344564.20 340702.86 303325.95 321065.53 382663.64 311911.38 ... 39084.04 17362.56 27507.45 18803.48 17619.95 24060.91 35665.78 23802.81 174972.00 105530.84
4.Asian 78615.49 82341.54 79499.56 73783.69 96289.08 88222.32 96411.97 92029.56 77070.10 90370.15 ... 30196.58 27745.90 18419.49 15406.79 7272.27 17891.33 18116.50 3606.67 57684.54 42662.74
5.Other 16134.33 19365.76 20647.55 24222.52 17469.53 27237.94 11220.90 6996.58 23640.43 14917.77 ... 4441.26 nan 1487.90 2845.89 522.43 2453.52 303.66 2982.57 18870.12 6232.88
在我看来,你只需要
pivot_table
。在为此,您需要在
groupby()
之后df.reset_index(inplace=True)
,然后:相关问题 更多 >
编程相关推荐