在DataFrame中计算新列中数值列的计数

2024-09-29 21:49:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样一个熊猫数据框:

  Movie Rate
0 5821  4
1 2124  2
2 7582  1
3 3029  5
4 17479 1

电影和评级都可以重复。我需要将此数据帧转换为如下内容:

  Movie Rate_1_Count Rate_2_Count ... Rate_5_Count
0 5821  20           1                5
1 2124  2            0                99
2 7582  50           22               22
...

电影ID是唯一的,Rate{Number}Count是该电影的分级计数,等于{Number}。 我已经使用下面的代码完成了这项任务,我认为下面的代码非常混乱。我想一定有更整洁的方法。有人能帮我吗

self.movie_df_tmp = self.rating_df[['MovieId', 'Rate']]

self.movie_df_tmp['RaCount'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('count')
self.movie_df_tmp['Sum'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('sum')
self.movie_df_tmp['NORC'] = self.movie_df_tmp.groupby(['MovieId', 'Rate'])['Rate'].transform('count')

self.movie_df_tmp = self.movie_df_tmp.drop_duplicates()
self.movie_df_tmp['Rate1C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 1]['NORC']
self.movie_df_tmp['Rate2C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 2]['NORC']
self.movie_df_tmp['Rate3C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 3]['NORC']
self.movie_df_tmp['Rate4C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 4]['NORC']
self.movie_df_tmp['Rate5C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 5]['NORC']
self.movie_df_tmp = self.movie_df_tmp.replace(np.nan, 0)
self.movie_df = self.movie_df_tmp[['MovieId', 'RaCount', 'Sum']].drop_duplicates()
self.movie_df_tmp = self.movie_df_tmp.drop(columns=['Rate', 'NORC', 'Sum', 'RaCount'])
self.movie_df_tmp = self.movie_df_tmp.groupby(['MovieId'])["Rate1C", "Rate2C", "Rate3C", "Rate4C", "Rate5C"].apply(
lambda x: x.astype(int).sum())
self.movie_df = self.movie_df.merge(self.movie_df_tmp, left_on='MovieId', right_on='MovieId')
self.movie_df = pd.DataFrame(self.movie_df.values,
columns=['MovieId', 'Rate1C', 'Rate2C', 'Rate3C', 'Rate4C',
'Rate5C'])

Tags: selfdf电影ratecounttransformmovietmp
2条回答

尝试使用^{}

pd.crosstab(df['Movie'], df['Rate'])
Rate   1  2  4  5
Movie            
2124   0  1  0  0
3029   0  0  0  1
5821   0  0  1  0
7582   1  0  0  0
17479  1  0  0  0

固定轴名称和列名^{}+^{}+^{}

new_df = (
    pd.crosstab(df['Movie'], df['Rate'])
        .rename(columns=lambda c: f'Rate_{c}_Count')
        .reset_index()
        .rename_axis(columns=None)
)
   Movie  Rate_1_Count  Rate_2_Count  Rate_4_Count  Rate_5_Count
0   2124             0             1             0             0
1   3029             0             0             0             1
2   5821             0             0             1             0
3   7582             1             0             0             0
4  17479             1             0             0             0

这将为您提供所需的输出:

grouper=df.groupby(['Movie','Rate']).size()
dg=pd.DataFrame()
dg['Movie']=df['Movie'].unique()
for i in [1,2,3,4,5]:
  dg['Rate_'+str(i)+'Count']=dg['Movie'].apply(lambda x: grouper[x,i] if (x,i) 
  in grouper.index else 0)

相关问题 更多 >

    热门问题