我有下面的excel数据,我想转换成下面提到的预期平面格式。在预期的格式中,我不希望包含聚合列,例如差分和合计。请帮助我使用python熊猫数据框架或Pyspark数据框架完成它
Excel中的输入数据为:
预期最终格式为
df是输入数据上的数据帧。下面是
df = pd.read_excel("D:/StackOverflow.xlsx",sheet_name = 'Input')
Data Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 Type Domestic NaN NaN International NaN NaN
1 Unit/Subunit Jan Feb Diff Jan Feb Diff
2 NaN 2020 2021 NaN 2020 2021 NaN
3 Unit1 NaN NaN NaN NaN NaN NaN
4 SubUnit11 100 130 30 5000 8000 3000
5 SubUnit12 50 80 30 1000 4000 3000
6 SubUnit13 1000 1100 100 100000 100010 10
7 Total Unit1 1150 1310 160 106000 112010 6010
8 Unit2 NaN NaN NaN NaN NaN NaN
9 SubUnit21 1100 1130 30 15000 81100 66100
10 SubUnit22 150 180 30 11000 42000 31000
11 SubUnit23 11000 1110 -9890 1001000 1022010 21010
12 Total Unit2 12250 2420 -9830 1027000 1145110 118110
print (df.index.tolist()[:3])
[0, 1, 2]
print (df.columns.tolist()[:3])
['Data', 'Unnamed: 1', 'Unnamed: 2']
print (df.head(10).to_dict())
{'Data': {0: 'Type', 1: 'Unit/Subunit', 2: nan, 3: 'Unit1', 4: 'SubUnit11', 5: 'SubUnit12', 6: 'SubUnit13', 7: 'Total Unit1', 8: 'Unit2', 9: 'SubUnit21'}, 'Unnamed: 1': {0: 'Domestic ', 1: 'Jan', 2: 2020, 3: nan, 4: 100, 5: 50, 6: 1000, 7: 1150, 8: nan, 9: 1100}, 'Unnamed: 2': {0: nan, 1: 'Feb', 2: 2021, 3: nan, 4: 130, 5: 80, 6: 1100, 7: 1310, 8: nan, 9: 1130}, 'Unnamed: 3': {0: nan, 1: 'Diff', 2: nan, 3: nan, 4: 30, 5: 30, 6: 100, 7: 160, 8: nan, 9: 30}, 'Unnamed: 4': {0: 'International', 1: 'Jan', 2: 2020, 3: nan, 4: 5000, 5: 1000, 6: 100000, 7: 106000, 8: nan, 9: 15000}, 'Unnamed: 5': {0: nan, 1: 'Feb', 2: 2021, 3: nan, 4: 8000, 5: 4000, 6: 100010, 7: 112010, 8: nan, 9: 81100}, 'Unnamed: 6': {0: nan, 1: 'Diff', 2: nan, 3: nan, 4: 3000, 5: 3000, 6: 10, 7: 6010, 8: nan, 9: 66100}}
df1 = pd.read_excel("D:/StackOverflow.xlsx",sheet_name = 'Input',header = [1,2,3])
Type Domestic International
Unit/Subunit Jan Feb Diff Jan Feb Diff
Unnamed: 0_level_2 2020 2021 Unnamed: 3_level_2 2020 2021 Unnamed: 6_level_2
0 Unit1 NaN NaN NaN NaN NaN NaN
1 SubUnit11 100.0 130.0 30.0 5000.0 8000.0 3000.0
2 SubUnit12 50.0 80.0 30.0 1000.0 4000.0 3000.0
3 SubUnit13 1000.0 1100.0 100.0 100000.0 100010.0 10.0
4 Total Unit1 1150.0 1310.0 160.0 106000.0 112010.0 6010.0
5 Unit2 NaN NaN NaN NaN NaN NaN
6 SubUnit21 1100.0 1130.0 30.0 15000.0 81100.0 66100.0
7 SubUnit22 150.0 180.0 30.0 11000.0 42000.0 31000.0
8 SubUnit23 11000.0 1110.0 -9890.0 1001000.0 1022010.0 21010.0
9 Total Unit2 12250.0 2420.0 -9830.0 1027000.0 1145110.0 118110.0
您可以使用^{} 进行重塑,然后删除不必要的行:
最后是添加日期时间列进行排序:
相关问题 更多 >
编程相关推荐