基于行小于任何列总和的百分比来过滤数据帧

2024-06-26 13:26:28 发布

您现在位置:Python中文网/ 问答频道 /正文

以下是我的示例数据:

{'Rhesus': {('count', u'augCGP,transMap'): 6.0, ('count', u'augTM,transMap'): 11563.0, ('count', u'transMap'): 39930.0, ('count', u'augTM'): 5114.0, ('count', u'augCGP,augTM,augTMR,transMap'): 27.0, ('count', u'augCGP,augTMR'): 1.0, ('count', u'augTMR,transMap'): 145.0, ('count', u'augTMR'): 4217.0, ('count', u'augCGP,augTMR,transMap'): nan, ('count', u'augCGP,augTM,augTMR'): nan, ('count', u'augCGP'): 4239.0, ('count', u'augCGP,augTM,transMap'): 3.0, ('count', u'augTM,augTMR,transMap'): 6296.0, ('count', u'augTM,augTMR'): 3357.0}, 'Susie': {('count', u'augCGP,transMap'): 11.0, ('count', u'augTM,transMap'): 10821.0, ('count', u'transMap'): 41300.0, ('count', u'augTM'): 2894.0, ('count', u'augCGP,augTM,augTMR,transMap'): 43.0, ('count', u'augCGP,augTMR'): nan, ('count', u'augTMR,transMap'): 353.0, ('count', u'augTMR'): 5399.0, ('count', u'augCGP,augTMR,transMap'): 1.0, ('count', u'augCGP,augTM,augTMR'): 1.0, ('count', u'augCGP'): 2740.0, ('count', u'augCGP,augTM,transMap'): 2.0, ('count', u'augTM,augTMR,transMap'): 10196.0, ('count', u'augTM,augTMR'): 2789.0}, 'Clint': {('count', u'augCGP,transMap'): 16.0, ('count', u'augTM,transMap'): 17341.0, ('count', u'transMap'): 39284.0, ('count', u'augTM'): 2888.0, ('count', u'augCGP,augTM,augTMR,transMap'): 80.0, ('count', u'augCGP,augTMR'): 1.0, ('count', u'augTMR,transMap'): 144.0, ('count', u'augTMR'): 2881.0, ('count', u'augCGP,augTMR,transMap'): nan, ('count', u'augCGP,augTM,augTMR'): 1.0, ('count', u'augCGP'): 2338.0, ('count', u'augCGP,augTM,transMap'): 8.0, ('count', u'augTM,augTMR,transMap'): 8725.0, ('count', u'augTM,augTMR'): 1441.0}, 'Orangutan': {('count', u'augCGP,transMap'): 7.0, ('count', u'augTM,transMap'): 6568.0, ('count', u'transMap'): 46113.0, ('count', u'augTM'): 3656.0, ('count', u'augCGP,augTM,augTMR,transMap'): 17.0, ('count', u'augCGP,augTMR'): nan, ('count', u'augTMR,transMap'): 284.0, ('count', u'augTMR'): 5952.0, ('count', u'augCGP,augTMR,transMap'): 1.0, ('count', u'augCGP,augTM,augTMR'): 1.0, ('count', u'augCGP'): 5753.0, ('count', u'augCGP,augTM,transMap'): 3.0, ('count', u'augTM,augTMR,transMap'): 6567.0, ('count', u'augTM,augTMR'): 3520.0}, 'Gibbon': {('count', u'augCGP,transMap'): 5.0, ('count', u'augTM,transMap'): 6828.0, ('count', u'transMap'): 44285.0, ('count', u'augTM'): 4313.0, ('count', u'augCGP,augTM,augTMR,transMap'): 16.0, ('count', u'augCGP,augTMR'): nan, ('count', u'augTMR,transMap'): 187.0, ('count', u'augTMR'): 6550.0, ('count', u'augCGP,augTMR,transMap'): nan, ('count', u'augCGP,augTM,augTMR'): 1.0, ('count', u'augCGP'): 4178.0, ('count', u'augCGP,augTM,transMap'): nan, ('count', u'augTM,augTMR,transMap'): 5839.0, ('count', u'augTM,augTMR'): 3882.0}}

这是一个数据帧,看起来像:

>>> df
genome                                Clint   Gibbon  Orangutan   Rhesus  \
      Transcript Modes
count augCGP                         2338.0   4178.0     5753.0   4239.0
      augCGP,augTM,augTMR               1.0      1.0        1.0      NaN
      augCGP,augTM,augTMR,transMap     80.0     16.0       17.0     27.0
      augCGP,augTM,transMap             8.0      NaN        3.0      3.0
      augCGP,augTMR                     1.0      NaN        NaN      1.0
      augCGP,augTMR,transMap            NaN      NaN        1.0      NaN
      augCGP,transMap                  16.0      5.0        7.0      6.0
      augTM                          2888.0   4313.0     3656.0   5114.0
      augTM,augTMR                   1441.0   3882.0     3520.0   3357.0
      augTM,augTMR,transMap          8725.0   5839.0     6567.0   6296.0
      augTM,transMap                17341.0   6828.0     6568.0  11563.0
      augTMR                         2881.0   6550.0     5952.0   4217.0
      augTMR,transMap                 144.0    187.0      284.0    145.0
      transMap                      39284.0  44285.0    46113.0  39930.0

genome                                Susie
      Transcript Modes
count augCGP                         2740.0
      augCGP,augTM,augTMR               1.0
      augCGP,augTM,augTMR,transMap     43.0
      augCGP,augTM,transMap             2.0
      augCGP,augTMR                     NaN
      augCGP,augTMR,transMap            1.0
      augCGP,transMap                  11.0
      augTM                          2894.0
      augTM,augTMR                   2789.0
      augTM,augTMR,transMap         10196.0
      augTM,transMap                10821.0
      augTMR                         5399.0
      augTMR,transMap                 353.0
      transMap                      41300.0

如您所见,其中一些类别的条目非常少。我想对每一行(Transcript Modes)进行过滤,这样如果它们在每一列的总数中所占的比例小于1%,就会删除它们。所以,我得到的数据帧看起来像:

>>> df
genome                                Clint   Gibbon  Orangutan   Rhesus  \
      Transcript Modes
count augCGP                         2338.0   4178.0     5753.0   4239.0
      augTM                          2888.0   4313.0     3656.0   5114.0
      augTM,augTMR                   1441.0   3882.0     3520.0   3357.0
      augTM,augTMR,transMap          8725.0   5839.0     6567.0   6296.0
      augTM,transMap                17341.0   6828.0     6568.0  11563.0
      augTMR                         2881.0   6550.0     5952.0   4217.0
      transMap                      39284.0  44285.0    46113.0  39930.0

genome                                Susie
      Transcript Modes
count augCGP                         2740.0
      augTM                          2894.0
      augTM,augTMR                   2789.0
      augTM,augTMR,transMap         10196.0
      augTM,transMap                10821.0
      augTMR                         5399.0
      transMap                      41300.0

Tags: 数据genomecountnantranscriptclintmodessusie
2条回答
row_sum = df.sum(axis=1)
total_sum = row_sum.sum()
print(df.loc[row_sum/total_sum > 0.01])

收益率

                               Clint   Gibbon  Orangutan   Rhesus    Susie
count augCGP                  2338.0   4178.0     5753.0   4239.0   2740.0
      augTM                   2888.0   4313.0     3656.0   5114.0   2894.0
      augTM,augTMR            1441.0   3882.0     3520.0   3357.0   2789.0
      augTM,augTMR,transMap   8725.0   5839.0     6567.0   6296.0  10196.0
      augTM,transMap         17341.0   6828.0     6568.0  11563.0  10821.0
      augTMR                  2881.0   6550.0     5952.0   4217.0   5399.0
      transMap               39284.0  44285.0    46113.0  39930.0  41300.0

相关问题 更多 >