计算datafram中唯一合著者的数目问题的回答

计算datafram中唯一合著者的数目

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

另一种方法 第一个groupby <code>BookID</code>和<a href="https://stackoverflow.com/a/53088007/4057186">list all authors per book (i.e. list all authors per group)</a> <pre><code>combos = df.groupby('BookID').agg(lambda x: list(x)).reset_index(drop=False) print(combos) BookID Author 0 1 [John, Alex, Jenna] 1 2 [John, Alex] 2 3 [John] 3 4 [Alex, Mary, Max] </code></pre> 接下来，在<code>BookID</code>上与主数据合并，以获得每个作者的所有作者 <pre><code>merged = combos.merge(df, how='inner', on='BookID') print(merged) BookID Author_x Author_y 0 1 [John, Alex, Jenna] John 1 1 [John, Alex, Jenna] Alex 2 1 [John, Alex, Jenna] Jenna 3 2 [John, Alex] John 4 2 [John, Alex] Alex 5 3 [John] John 6 4 [Alex, Mary, Max] Alex 7 4 [Alex, Mary, Max] Mary 8 4 [Alex, Mary, Max] Max </code></pre> <code>Author_x</code>是完整的作者列表，包括<code>Author_y</code>。现在可以使用以下方法将完整的作者列表（<code>Author_x</code>）与每个单独/唯一的作者（<code>Author_y</code>）进行比较 <ol> <li><a href="https://stackoverflow.com/a/3869503/4057186">Create dict whose keys are unique ^{<cd4>} values</a>（即唯一作者）和值是空列表</li> <li>迭代dict中的每个键值对</li> <li>使用<code>Author_y</code>列对上述步骤中的合并数据帧进行切片；这将在dict键中为author提供所有authors</li> <li>从slice获取所有作者的列表（<code>Author_x</code>）作为扁平列表</li> <li><a href="https://stackoverflow.com/a/252711/4057186">extend blank list</a>带<a href="https://stackoverflow.com/a/3462160/4057186">difference between flattened list (all authors) and dict key</a></li> </ol> <pre><code>d = {auth:[] for auth in df['Author'].unique()} for k,v in d.items(): all_auths = merged[merged['Author_y']==k]['Author_x'].values.tolist() auths = [coauths for nested in all_auths for coauths in nested] v.extend(list(set(auths) - set([k]))) </code></pre> 最后，放入<code>DataFrame</code>并计算每行的非空值 <pre><code>cnames = ['coauth'+str(k) for k in range(1,len(d))] df_summary = pd.DataFrame.from_dict(d, orient='index', columns=cnames) df_summary['Num_Unique_CoAuthors'] = df_summary.shape[1] - df_summary.isna().sum(axis=1) print(df_summary) author coauth1 coauth2 coauth3 coauth4 Num_Unique_CoAuthors 0 John Alex Jenna None None 2 1 Alex Max John Mary Jenna 4 2 Jenna John Alex None None 2 3 Mary Max Alex None None 2 4 Max Alex Mary None None 2 </code></pre> 扩展数据案例 如果主数据包含单个作者（即没有任何共同作者），则此方法为该行打印零 下面是添加到数据中的虚拟行，只有一个作者 <pre><code>print(df) BookID Author 0 1 John 1 1 Alex 2 1 Jenna 3 2 John 4 2 Alex 5 3 John 6 4 Alex 7 4 Mary 8 4 Max 9 5 Tom </code></pre> 这是输出 <pre><code> author coauth1 coauth2 coauth3 coauth4 Num_Unique_CoAuthors 0 John Jenna Alex None None 2 1 Alex Mary John Jenna Max 4 2 Jenna John Alex None None 2 3 Mary Max Alex None None 2 4 Max Mary Alex None None 2 5 Tom None None None None 0 </code></pre> 初始答案 你有没有用<code>sum</code>聚合来尝试<code>groupby</code> <pre><code>df.groupby(['Author'])['BookID'].sum() </code></pre>

计算datafram中唯一合著者的数目

1 个回答

相关Python问题