我已经创建了下面的数据框,列出了用户访问的页面,按访问日期的升序排列。共有5页:BLQ2\ U 1至BLQ2\ U 5
user_id created_at PAGE
72672 2017-02-20 BLQ2_1
72672 2017-03-03 BLQ2_5
72672 2017-03-03 BLQ2_3
72672 2017-03-05 BLQ2_4
12370 2017-03-06 BLQ2_4
12370 2017-03-06 BLQ2_5
12370 2017-03-06 BLQ2_3
94822 2017-03-06 BLQ2_2
94822 2017-03-10 BLQ2_4
94822 2017-03-10 BLQ2_5
94822 2017-02-24 BLQ2_4
对于每一个页面,我想获得有关访问的上一个页面的所有用户的统计信息。也就是说,我需要计算每个页面的统计信息,例如:
Path to BLQ2_5 is: 2 times from BLQ2_4 and 1 time from BLQ2_1.
Path to BLQ2_3 is: 2 times from BLQ2_5 and 1 time from BLQ2_4.
Path to BLQ2_4 is: 1 time from BLQ2_5, 1 time from BLQ2_3, 1 time from BLQ2_2, and 1 time from nowhere.
我必须使用循环吗?还是有办法利用熊猫的groupby
功能?有什么建议吗
下面是我使用for循环的解决方案:
pg_BLQ2_5 = pd.DataFrame()
pg_BLQ2_4 = pd.DataFrame()
pg_BLQ2_3 = pd.DataFrame()
pg_BLQ2_2 = pd.DataFrame()
pg_BLQ2_1 = pd.DataFrame()
first_pages = pd.DataFrame()
for user_id in df['user_id'].unique():
#get only current user's records, and reset index
_pg = df[df['user_id'] == user_id].reset_index()
_pg.drop('index', axis=1, inplace=True)
#if this is the first page visited, treat differently
first_page = _pg.iloc[0]
first_pages = first_pages.append(first_page)
#exclude the first page visited from the dataframe
_pg = _pg.loc[1:].reset_index()
_pg.drop('index', axis=1, inplace=True)
#for each page, get the record from its previous index, and build the dataframe.
pg_BLQ2_5 = pg_BLQ2_5.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_5'].index -1])
pg_BLQ2_4 = pg_BLQ2_4.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_4'].index -1])
pg_BLQ2_3 = pg_BLQ2_3.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_3'].index -1])
pg_BLQ2_2 = pg_BLQ2_2.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_2'].index -1])
pg_BLQ2_1 = pg_BLQ2_1.append(_pg.iloc[_pg[_pg['PAGE'] == 'BLQ2_1'].index -1])
首先创建一个显示上一页的列(假设数据帧按用户排序,然后按日期排序):
然后简单地
groupby
计算值:例如,您还可以使用
unstack
重塑形状相关问题 更多 >
编程相关推荐