<p>让我们对具有<code>merge</code>和<code>pd.crosstab</code>的用户id使用自联接进行计数:</p>
<pre><code>import pandas as pd
from io import StringIO
txt = StringIO("""user_id page_view_page_slug
1 slug1
1 slug2
1 slug3
1 slug4
2 slug5
2 slug3
2 slug2
2 slug1""")
df = pd.read_csv(txt, sep='\s\s+')
dfm = df.merge(df, on='user_id')
df_out = pd.crosstab(dfm['page_view_page_slug_x'], dfm['page_view_page_slug_y'])
df_out
</code></pre>
<p>输出:</p>
<pre><code>page_view_page_slug_y slug1 slug2 slug3 slug4 slug5
page_view_page_slug_x
slug1 2 2 2 1 1
slug2 2 2 2 1 1
slug3 2 2 2 1 1
slug4 1 1 1 1 0
slug5 1 1 1 0 1
</code></pre>
<p>对于重复数据,让我们尝试:</p>
<pre><code>dfi = df.assign(v_count=df.groupby(['user_id', 'page_view_page_slug']).cumcount())
#Let's filter some unnecessary joins with query
dfi = dfi.merge(dfi, on=['user_id'])\
.query('page_view_page_slug_x != page_view_page_slug_y or page_view_page_slug_x == page_view_page_slug_y and v_count_x == v_count_y')
df_out = pd.crosstab(dfi['page_view_page_slug_x'], dfi['page_view_page_slug_y'])
df_out
</code></pre>
<p>输出:</p>
<pre><code>page_view_page_slug_y slug1 slug2 slug3 slug4 slug5
page_view_page_slug_x
slug1 3 3 3 2 1
slug2 3 2 2 1 1
slug3 3 2 2 1 1
slug4 2 1 1 1 0
slug5 1 1 1 0 1
</code></pre>