如果子数据帧包含超过40%的数据,如何删除它

2024-09-28 01:24:46 发布

您现在位置:Python中文网/ 问答频道 /正文

大家好,我有这样的问题:

我有400.000个对象的面板数据,如果其中包含超过40%的NAN,我想删除对象

例如:

               inn  time_reg   revenue1    balans1    equity1  opprofit1  \
0        0101000021      2006   457000.0   115000.0    28000.0    29000.0   
1        0101000021      2007  1943000.0   186000.0   104000.0    99000.0   
2        0101000021      2008  2812000.0   318000.0   223000.0   127000.0   
3        0101000021      2009  2673000.0   370000.0   242000.0    39000.0   
4        0101000021      2010  3240000.0   435000.0    45000.0        NaN   
...             ...       ...        ...        ...        ...        ...   
4081810  9909403758      2003  6943000.0  2185000.0  2136000.0   -97000.0   
4081811  9909403758      2004  6504000.0  2245000.0  2196000.0   -34000.0   
4081812  9909403758      2005        NaN        NaN        NaN        NaN   
4081813  9909403758      2006        NaN        NaN        NaN        NaN   
4081814  9909403758      2007        NaN        NaN        NaN        NaN   

        grossprofit1  netprofit1  currentassets1  stliabilities1  
0             92000.0     18000.0        105000.0         87000.0  
1            189000.0     76000.0        176000.0         82000.0  
2            472000.0    119000.0        308000.0         95000.0  
3            483000.0     29000.0        360000.0        128000.0  
4                 NaN     35000.0             NaN             NaN  
...               ...         ...             ...             ...  
4081810     2365000.0    -59000.0        253000.0         49000.0  
4081811     2278000.0     60000.0        425000.0         49000.0  
4081812           NaN         NaN             NaN             NaN  
4081813           NaN         NaN             NaN             NaN  
4081814           NaN         NaN             NaN             NaN  

我有这样的数据框,对于按(inn,time_reg)分组的每个子数据框,如果列中的总NAN(收入1平衡1权益1机会1总利润1净利润1当前资产1负债1)超过40%,我需要删除它

我有一个循环的想法,但这需要很多时间 例如:

                inn  time_reg   revenue1    balans1    equity1  opprofit1  \
4081809  9909403758      2002  6078000.0  2270000.0  2195000.0   -32000.0   
4081810  9909403758      2003  6943000.0  2185000.0  2136000.0   -97000.0   
4081811  9909403758      2004  6504000.0  2245000.0  2196000.0   -34000.0   
4081812  9909403758      2005        NaN        NaN        NaN        NaN   
4081813  9909403758      2006        NaN        NaN        NaN        NaN   
4081814  9909403758      2007        NaN        NaN        NaN        NaN   

         grossprofit1  netprofit1  currentassets1  stliabilities1  
4081809     1324000.0         NaN        234000.0         75000.0  
4081810     2365000.0    -59000.0        253000.0         49000.0  
4081811     2278000.0     60000.0        425000.0         49000.0  
4081812           NaN         NaN             NaN             NaN  
4081813           NaN         NaN             NaN             NaN  
4081814           NaN         NaN             NaN             NaN  

这个子数据帧应该删除,因为它包含超过40%的NAN

      inn  time_reg    revenue1    balans1   equity1  opprofit1  \
0   0101000021      2006    457000.0   115000.0   28000.0    29000.0   
1   0101000021      2007   1943000.0   186000.0  104000.0    99000.0   
2   0101000021      2008   2812000.0   318000.0  223000.0   127000.0   
3   0101000021      2009   2673000.0   370000.0  242000.0    39000.0   
4   0101000021      2010   3240000.0   435000.0   45000.0        NaN   
5   0101000021      2011   3480000.0   610000.0   71000.0        NaN   
6   0101000021      2012   4820000.0   710000.0  139000.0   149000.0   
7   0101000021      2013   5200000.0   790000.0  148000.0   170000.0   
8   0101000021      2014   5450000.0   830000.0  155000.0   180000.0   
9   0101000021      2015   5620000.0   860000.0  164000.0   189000.0   
10  0101000021      2016   5860000.0   885000.0  175000.0   200000.0   
11  0101000021      2017  15112000.0  1275000.0  298000.0   323000.0   

    grossprofit1  netprofit1  currentassets1  stliabilities1  
0        92000.0     18000.0        105000.0         87000.0  
1       189000.0     76000.0        176000.0         82000.0  
2       472000.0    119000.0        308000.0         95000.0  
3       483000.0     29000.0        360000.0        128000.0  
4            NaN     35000.0             NaN             NaN  
5            NaN     61000.0             NaN             NaN  
6       869000.0    129000.0        700000.0        571000.0  
7      1040000.0    138000.0        780000.0        642000.0  
8      1090000.0    145000.0        820000.0        675000.0  
9      1124000.0    154000.0        850000.0        696000.0  
10     1172000.0    165000.0        875000.0        710000.0  
11     3023000.0    288000.0       1265000.0        977000.0  

此子数据帧包含的NAN少于40%,并且必须位于最终数据帧中


Tags: 数据对象timenanreginnequity1netprofit1
2条回答

您可以使用pd.DataFrame.groupbyfilter方法。 这允许您传递一个函数,该函数指示是否应过滤子帧(在这种情况下,如果相关列中包含超过40%的NAN)。要获取该信息,可以使用numpy将NAN计数为getNanFraction

def getNanFraction(df):
    nanCount = np.sum(np.isnan(df.drop("inn", axis=1).values))
    return nanCount/len(df)


df.groupby("inn").filter(lambda x: getNanFraction(x) < 0.4 )

如果使用numpy/pandas函数进行计数,循环是否也会太慢?您可以使用someDataFrame.isnull().sum().sum()

可能比编写自己的循环来遍历数据帧中的所有值要快得多,因为这些库往往具有这些类型函数的非常高效的实现

相关问题 更多 >

    热门问题