Pandas中的问题:在加载到df时正确合并并迭代平均?

2024-09-30 06:33:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我合并的数据集有问题。我有两个集合,我必须在一个特定的密钥标识符上组合,称为“msno”。并非所有的值都存在,而且有人可以多次出现在数据集中

代码示例

colnamesa = ['msno','date','num_25','num_50','num_75','num_985']

colnamesb = ['msno','city','bd','gender',\
             'registered_via','registration_init_time']


a = pandas.read_csv('userlogs.csv',  names= colnamesa, skiprows=[0])
b = pandas.read_csv('members.csv', names= colnamesb,skiprows=[0])
c = a.merge(b, how='outer', on ='msno')
df = c.dropna(thresh=4)`

输出

    msno        date  num_25  num_50  num_75  num_985  num_100  num_unq  total_secs  city    bd gender  registered_via  registration_init_time
0     u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=  20170331.0     8.0     4.0     0.0      1.0     21.0     18.0    6309.273   1.0   0.0    NaN             7.0              20161220.0
1     u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=  20170316.0     6.0     4.0     1.0      3.0     26.0     31.0    7926.107   1.0   0.0    NaN             7.0              20161220.0
2     u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=  20170325.0     6.0     4.0     2.0      1.0     65.0     58.0   17148.343   1.0   0.0    NaN             7.0              20161220.0
3     u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=  20170310.0    10.0     2.0     1.0      5.0     35.0     39.0   10519.150   1.0   0.0    NaN             7.0              20161220.0
4     u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=  20170328.0   101.0     1.0     3.0      6.0     34.0     80.0   11046.850   1.0   0.0    NaN             7.0              20161220.0
5     u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=  20170307.0    13.0     2.0     3.0      2.0     45.0     55.0   12581.496   1.0   0.0    NaN             7.0              20161220.0
6     u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=  20170321.0    13.0     3.0     2.0      1.0     41.0     31.0   11806.946   1.0   0.0    NaN             7.0              20161220.0
7     u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=  20170315.0    14.0     7.0     3.0     11.0     24.0     41.0   10153.821   1.0   0.0    NaN             7.0              20161220.0
8     u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=  20170330.0     0.0     0.0     1.0      0.0     24.0      2.0    5773.754   1.0   0.0    NaN             7.0              20161220.0

所需输出 对于所有具有相同msno(他们是同一个人)的条目,我想用num_25,…,total_seconds来平均分数,但不是日期。这可行吗


Tags: csv数据citydatenangendernumvia

热门问题