从分组中查找平均值并显示所有信息

df1 = pd.DataFrame({'userId': [1,1,1,2,2,3,4,4], 'movieId': [500,600,700,1100,1200,600,600,1900], 'ratings': [3.5,4.5,2.0,5.0,4.0,4.5,5.0,3.5]}) df2 = pd.DataFrame({'userId':[1,1,2,3,4,5], 'movieId':[500,600,1100,800,900,600], 'tag':['Highly quotable','Boxing story','MMA','Tom Hardy','Fun','long movie']}) frames = [df1, df2] result = pd.concat(frames, sort = False) result userId movieId ratings tag 0 1 500 3.5 NaN 1 1 600 4.5 NaN 2 1 700 2.0 NaN 3 2 1100 5.0 NaN 4 2 1200 4.0 NaN 5 3 600 4.5 NaN 6 4 600 5.0 NaN 7 4 1900 3.5 NaN 0 1 500 NaN Highly quotable 1 1 600 NaN Boxing story 2 2 1100 NaN MMA 3 3 800 NaN Tom Hardy 4 4 900 NaN Fun 5 5 600 NaN long movie

1条回答

网友

1楼 · 发布于 2024-09-28 19:22:49

我会提出不同的建议。我不会使用concat，而是使用pd.merge

看看这个：

import pandas as pd

df1 = pd.DataFrame({'userId': [1,1,1,2,2,3,4,4],
                   'movieId': [500,600,700,1100,1200,600,600,1900],
                   'ratings': [3.5,4.5,2.0,5.0,4.0,4.5,5.0,3.5]})


df2 = pd.DataFrame({'userId':[1,1,2,3,4,5],
                    'movieId':[500,600,1100,800,900,600],
                    'tag':['Highly quotable','Boxing story','MMA','Tom Hardy','Fun','long movie']})

# Merging df1 and df2, now you'll not have unnecessary NaN Values
result = df1.merge(df2[['movieId', 'tag']], on='movieId', how='left')

# Grouping by using two tipes of output with agg
result.groupby(by=['movieId', 'tag'], as_index=False).agg({'ratings': ['count', 'mean']})

输出将是：

  movieId              tag ratings          
                             count      mean
0     500  Highly quotable       1  3.500000
1     600     Boxing story       3  4.666667
2     600       long movie       3  4.666667
3    1100              MMA       1  5.000000

希望对你有用

编辑

正如您在评论中所问的，如果您想过滤数据帧，只需运行下面的代码：

# Removing multiindex columns (just to be easier for you)
result = result.droplevel(0, axis=1)
result.columns = ['userId', 'movieId', 'ratings_count', 'ratings_mean']

# Filtering
result = result[result['ratings_count'] >= 2]
result = result[result['ratings_mean'] >= 3]

有更好的方法可以做到这一点，但我假设您还不知道如何使用Pandas MultiIndex，所以我做了一个简单的解决方案

相关问题更多 >

编程相关推荐

热门问题

热门文章