如何为每个使用pandas的唯一用户确定优先级操作

2024-09-26 17:51:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个像

ID                                   address     used_at      active_seconds    pageviews
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2014-03-17 00:24:47   148   3
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2014-03-17 01:08:29   34    4
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2014-05-02 17:47:39   22    1
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2015-01-03 01:37:05   224   5
bcb0cc3d7f01dc6297f1331362a0fc09    e1.ru       2015-01-11 03:49:50   54    1
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2015-03-10 22:11:01   26    7
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2015-03-25 03:02:07   22    4
690ef4613fd977f9c29e1124b9d5814c    avito.ru    2014-02-05 09:25:56   6     3
690ef4613fd977f9c29e1124b9d5814c    avito.ru    2014-03-18 11:27:49   244   14

所有文件都在那里file_with_data 我需要打印出2014年和2015年用户的优先权相对于网站的变化情况。也就是说,有必要统计一下他们先坐在什么网站上,然后坐在哪里。你知道吗

我想我需要一个循环

infile = pd.read_csv("avito_trend.csv", parse_dates=[2])
for id in infile['ID'].nunique():

但它不起作用。 我如何才能绕过所有的唯一id,并获得有关访问的信息?你知道吗


Tags: csvid网站addressruinfileatused
1条回答
网友
1楼 · 发布于 2024-09-26 17:51:30

你可以这样做:

import pandas as pd

cols = ['ID', 'address', 'used_at']
df = pd.read_csv(r'avito_trend.csv', parse_dates=['used_at'], usecols=cols)

# sort DF by ID, Timestamp, address
df.sort_values(['ID','used_at','address'], inplace=True)

# adding helper columns: 'prev_address' and 'time_diff'
df['prev_address'] = df['address'].shift()
df['time_diff'] = df['used_at'] - df['used_at'].shift()

# exclude those where  `address` == `prev_address`
df = df[df['address'] != df['prev_address']]
# exclude those with `time_diff` > 10 minutes (please set desired value)
df = df[df['time_diff'] <= pd.Timedelta('10min')]

# group by (address, prev_address, df.used_at.dt.year) and count results
df[['ID','address','prev_address']] \
  .groupby(['address','prev_address', df.used_at.dt.year]) \
  .count() \
  .reset_index()

注意:请注意pd.Timedelta('10min')-您可能需要调整时间增量

更新:year添加到groupby()

In [15]: df[['ID','address','prev_address']].groupby(['address','prev_address', df.used_at.dt.year]).count().reset_index()
Out[15]:
         address       prev_address  used_at    ID
0          am.ru            auto.ru     2014   103
1          am.ru            auto.ru     2015   135
2          am.ru           avito.ru     2014   133
3          am.ru           avito.ru     2015    31
4          am.ru      avtomarket.ru     2014    14
5          am.ru      avtomarket.ru     2015     6
6          am.ru  cars.mail.ru/sale     2014    17
7          am.ru  cars.mail.ru/sale     2015     8
8          am.ru            drom.ru     2014    65
9          am.ru            drom.ru     2015    29
10         am.ru              e1.ru     2014    33
11         am.ru              e1.ru     2015    17
12         am.ru        irr.ru/cars     2014    26
13         am.ru        irr.ru/cars     2015    20
14       auto.ru              am.ru     2014    86
15       auto.ru              am.ru     2015    77
16       auto.ru           avito.ru     2014  1316
17       auto.ru           avito.ru     2015  1052
18       auto.ru      avtomarket.ru     2014    39
19       auto.ru      avtomarket.ru     2015    32
20       auto.ru  cars.mail.ru/sale     2014    94
21       auto.ru  cars.mail.ru/sale     2015    31
22       auto.ru            drom.ru     2014   219
23       auto.ru            drom.ru     2015   205
24       auto.ru              e1.ru     2014   174
25       auto.ru              e1.ru     2015    84
26       auto.ru        irr.ru/cars     2014    89
27       auto.ru        irr.ru/cars     2015    41
28      avito.ru              am.ru     2014   109
29      avito.ru              am.ru     2015    49
..           ...                ...      ...   ...

更新2:

如果要将两列合并为一列:

new = df[['ID','address','prev_address']].groupby(['address','prev_address', df.used_at.dt.year]).count().reset_index()

new['visit'] = new['prev_address'] + ' -> ' + new['address']

相关问题 更多 >

    热门问题