基于两列对行进行分组并创建第三列，查找小于x的组并与其他组合并

id herd birth H_BY HYcount death H_DY HYcount2 1 1345 2005-01-09 134505 1 2010-01-09 134510 1 2 1345 2010-03-05 134510 2 2015-01-09 134515 2 3 1345 2010-05-10 134510 2 2015-01-09 134515 2 4 1345 2011-06-01 134511 1 2016-01-09 134516 1 5 1345 2012-09-01 134512 1 2017-01-09 134517 2 6 1345 2015-09-13 134515 1 2017-01-09 134517 2 7 1346 2015-10-01 134615 3 2019-01-09 134619 1 8 1346 2015-10-27 134615 3 2020-01-09 134620 2 9 1346 2015-11-10 134615 3 2020-01-09 134620 2 10 1346 2016-12-10 134616 1 2021-01-09 134621 1

#Sort the df by the relevant value df= df.sort_values(by=['H_BY']) df.loc[ (df['HYcount'] < 3), 'H_BY'] = df['H_BY'].shift(-1) #Count the values again df['HC1_c'] = df.groupby('H_BY')['H_BY'].transform('count')

id herd birth H_BY HYcount death H_DY HYcount2 1 1345 2005-01-09 134510 3 2010-01-09 134515 3 2 1345 2010-03-05 134510 3 2015-01-09 134515 3 3 1345 2010-05-10 134510 3 2015-01-09 134515 3 4 1345 2011-06-01 134515 3 2016-01-09 134517 3 5 1345 2012-09-01 134515 3 2017-01-09 134517 3 6 1345 2015-09-13 134515 3 2017-01-09 134517 3 7 1346 2015-10-01 134615 4 2019-01-09 134620 4 8 1346 2015-10-27 134615 4 2020-01-09 134620 4 9 1346 2015-11-10 134615 4 2020-01-09 134620 4 10 1346 2016-12-10 134615 4 2021-01-09 134620 4

1条回答

网友
1楼 · 发布于 2024-10-02 22:36:05

为了解决这个问题，我删除了H_BY和H_DY列，以便允许对组进行动态计数。在数据帧中包含计数的问题之一是，如前所述，您需要重新计算它更改分组的时间，以及重复计数的时间
然后，我将birth和death更改为datetimes，以便为出生年份和死亡年份by和dy创建新列
ff = df[[ 'herd', 'birth', 'death' ]].copy() ff[ 'birth' ] = pd.to_datetime( ff[ 'birth' ] ) ff[ 'death' ] = pd.to_datetime( ff[ 'death' ] ) ff = ff.assign( by = lambda x: x.birth.apply( lambda y: y.year ), dy = lambda x: x.death.apply( lambda y: y.year ) )
^{tb1}$
对于实际处理，我们首先按herd分组，这样就不会在它们之间混淆。然后，如果可能，我们向前合并组，否则向后合并组，直到不再发生合并。最后，我们将这些组分配回原始数据
tdf = [] for herd, data in ff.groupby( 'herd' ): # get counts and assign initial groups counts = data[ 'by' ].value_counts().sort_index().to_frame() counts[ 'group' ] = range( counts.shape[ 0 ] ) while True: gcounts = counts.groupby( 'group' ).sum()[ 'by' ] # group counts change = gcounts[ gcounts.values < 3 ] # groups with too few if change.shape[ 0 ] == 0: # no changes, exit break # check how to merge groups cgroup = change.index.min() groups = gcounts.index.values g_ind = list( groups ).index( cgroup ) if ( g_ind + 1 ) < groups.shape[ 0 ]: # merge forward ngroup = groups[ g_ind + 1 ] elif g_ind > 0: # merge backward ngroup = groups[ g_ind - 1 ] else: # no groups to merge print( f'Can not merge herd {herd}' ) break counts.loc[ counts[ 'group' ] == cgroup, 'group' ] = ngroup # assign groups for ind, gdata in counts.iterrows(): data.loc[ data[ 'by' ] == ind, 'group' ] = gdata[ 'group' ] tdf.append( data ) tdf = pd.concat( tdf )
^{tb2}$
最后，如果仍然需要用于分组的H_BY标识符，可以使用
tdf[ 'H_BY' ] = tdf[ 'herd' ].astype( 'str' ) + tdf[ 'group' ].astype( int ).astype( str )

相关问题更多 >

编程相关推荐

热门问题

热门文章