数据帧操作和python合并

2024-10-03 23:30:25 发布

您现在位置:Python中文网/ 问答频道 /正文

  I have a dataframe,df1 

        inp   aco   drtn                
      2.3.6  dp   Less than 1 min         
      2.3.6  ft   5-10 min        
      2.5.9  dp   More than 1 hour        
      0.8.0  dp    1-5 min                
      2.3.6  dp   10-30 min               
      2.3.6  dp   More than 1 hour        
      0.8.0  dp   Less than 1 min         
      0.8.0  dp   1-5 min                 

df1应该通过计算出现次数按3列进行分组。新的数据帧df2将如下所示:

       inp   aco   drtn                count
      2.3.6  dp   Less than 1 min         1
      2.3.6  ft   5-10 min                1
      2.5.9  dp   More than 1 hour        1
      0.8.0  dp    1-5 min                2
      2.3.6  dp   10-30 min               1
      2.3.6  dp   More than 1 hour        1
      6.2.6  dp   1-5 min                 1

列:“drtn”应转换为新列:“convrt”。 例如,此convrt列将如下所示:小于1分钟=0.59分钟,大于1小时=61分钟,1-5分钟=5分钟,5-10分钟=10分钟,10-30分钟=30分钟。另一个新列:“calc”应定义为“count”列的值乘以“convrt”列中的值。然后新的数据帧df3应如下所示:

       inp   aco   drtn                 count   convrt   calc
      2.3.6  dp   Less than 1 min         1     0.59     0.59
      2.3.6  ft   5-10 min                1      10      10
      2.5.9  dp   More than 1 hour        1      61      61
      0.8.0  dp    1-5 min                2       5      10
      2.3.6  dp   10-30 min               1      30      30
      2.3.6  dp   More than 1 hour        1      61      61
      6.2.6  dp   1-5 min                 1       5       5

然后是一个新的数据帧df4,它被列“aco”过滤。为了示例:仅值只有保留dp。然后是一个新的专栏:pct计算列“calc”中更改的百分比。你知道吗

       inp   aco   drtn                 count   convrt   calc   pct
      2.3.6  dp   Less than 1 min         1     0.59     0.59   0.003
      2.5.9  dp   More than 1 hour        1      61      61     36.40
      0.8.0  dp    1-5 min                2       5      10     0.060
      2.3.6  dp   10-30 min               1      30      30     17.90 
      2.3.6  dp   More than 1 hour        1      61      61     36.40
      6.2.6  dp   1-5 min                 1       5       5     0.030 

然后是一个新的数据帧df5,它有一个新的列“pct”,它将df3的列“calc”和df4的列“calc”中的所有值相加,然后继续除以(df4/df3)并乘以100,索引名为列“aco”的过滤值。然后,一个新的列“totalCalcFilteredColumn”获取该列的总和:“calc”在过滤数据帧df4中。 另一个名为“diff”的列将df3的“calc”列和df4的“calc”列中的所有值相加,然后进行减法运算(df3-df4)

        pctTime      totalCalcFilteredColumn     diff    
  dp    94.37         167.59                     10        
  ft    5.63           10                       167.59    

我该怎么办?你知道吗


Tags: 数据morecalcmindplessinpthan
1条回答
网友
1楼 · 发布于 2024-10-03 23:30:25

设置:

temp=u"""inp;aco;drtn
2.3.6;dp;Less than 1 min
2.3.6;ft;5-10 min
2.5.9;dp;More than 1 hour
0.8.0;dp;1-5 min
2.3.6;dp;10-30 min
2.3.6;dp;More than 1 hour
0.8.0;dp;1-5 min
6.2.6;dp;1-5 min"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'

df = pd.read_csv(pd.compat.StringIO(temp), sep=";")
print (df)
     inp aco              drtn
0  2.3.6  dp   Less than 1 min
1  2.3.6  ft          5-10 min
2  2.5.9  dp  More than 1 hour
3  0.8.0  dp           1-5 min
4  2.3.6  dp         10-30 min
5  2.3.6  dp  More than 1 hour
6  0.8.0  dp           1-5 min
7  6.2.6  dp           1-5 min

解决方案:

d = {'1-5 min': 5, '10-30 min': 30, '5-10 min': 10,
'Less than 1 min': 0.59, 'More than 1 hour': 61}

df = df.groupby(['inp', 'aco', 'drtn'], sort=False).size().reset_index(name='count')
#map column by dictionary
df['convrt'] = df['drtn'].map(d)
df['calc'] = df['convrt'].mul(df['count'])
#divide by groups - transform create Series with same size as original df
df['pct'] = df['calc'].div(df.groupby('aco')['calc'].transform('sum')).mul(100)
print (df)
     inp aco              drtn  count  convrt   calc         pct
0  2.3.6  dp   Less than 1 min      1    0.59   0.59    0.352050
1  2.3.6  ft          5-10 min      1   10.00  10.00  100.000000
2  2.5.9  dp  More than 1 hour      1   61.00  61.00   36.398353
3  0.8.0  dp           1-5 min      2    5.00  10.00    5.966943
4  2.3.6  dp         10-30 min      1   30.00  30.00   17.900829
5  2.3.6  dp  More than 1 hour      1   61.00  61.00   36.398353
6  6.2.6  dp           1-5 min      1    5.00   5.00    2.983472

#aggregate sum 
df = df.groupby('aco')['calc'].sum().reset_index(name='totalCalcFilteredColumn')
summed = df['totalCalcFilteredColumn'].sum()
df['pctTime'] = df['totalCalcFilteredColumn'].div(summed).mul(100)
#rsub means sub from right  summed - df['calc']
df['diff'] = df['totalCalcFilteredColumn'].rsub(summed)
print (df)
  aco  totalCalcFilteredColumn    pctTime    diff
0  dp                   167.59  94.369052   10.00
1  ft                    10.00   5.630948  167.59

相关问题 更多 >