计算datafram中唯一合著者的数目

3条回答

网友

1楼 · 编辑于 2024-06-25 22:58:39

首先为每个组创建set到新列，然后获得与Author列的差异，通过^{}移除空集，最后将值展平到新集以获得唯一，最后获取长度：

df = df.join(df.groupby('BookID')['Author'].apply(set).rename('new'), 'BookID')

df['new'] = [b - set([a]) for a, b in zip(df['Author'], df['new'])]

df = (df[df['new'].astype(bool)].groupby('Author')['new']
          .apply(lambda x: tuple(set([z for y in x for z in y])))
          .to_frame())

df.insert(0, 'Num_Unique_CoAuthors', df['new'].str.len())
print (df)
        Num_Unique_CoAuthors                       new
Author                                                
Alex                       4  (Max, John, Jenna, Mary)
Jenna                      2              (John, Alex)
John                       2             (Jenna, Alex)
Mary                       2               (Max, Alex)
Max                        2              (Mary, Alex)

网友

2楼 · 编辑于 2024-06-25 22:58:39

我有另一个解决办法。你知道吗

加入BookID
使用^{}创建邻接矩阵
按行计算计数，不包括行的作者。你知道吗

>>> df_merge = df.merge(df, on='BookID')
>>> ctdf = pd.crosstab(df_merge.Author_x, df_merge.Author_y, aggfunc='max', values=[1] * len(df_merge)).fillna(0)
>>> ctdf
Author_y  Alex  Jenna  John  Mary  Max
Author_x
Alex       1.0    1.0   1.0   1.0  1.0
Jenna      1.0    1.0   1.0   0.0  0.0
John       1.0    1.0   1.0   0.0  0.0
Mary       1.0    0.0   0.0   1.0  1.0
Max        1.0    0.0   0.0   1.0  1.0
>>> ctdf.apply(lambda x: sum([*x]) - 1)
Author_y
Alex     4.0
Jenna    2.0
John     2.0
Mary     2.0
Max      2.0
dtype: float64

网友

3楼 · 编辑于 2024-06-25 22:58:39

另一种方法

第一个groupby BookID和list all authors per book (i.e. list all authors per group)

combos = df.groupby('BookID').agg(lambda x: list(x)).reset_index(drop=False)
print(combos)
   BookID               Author
0       1  [John, Alex, Jenna]
1       2         [John, Alex]
2       3               [John]
3       4    [Alex, Mary, Max]

接下来，在BookID上与主数据合并，以获得每个作者的所有作者

merged = combos.merge(df, how='inner', on='BookID')
print(merged)
   BookID             Author_x Author_y
0       1  [John, Alex, Jenna]     John
1       1  [John, Alex, Jenna]     Alex
2       1  [John, Alex, Jenna]    Jenna
3       2         [John, Alex]     John
4       2         [John, Alex]     Alex
5       3               [John]     John
6       4    [Alex, Mary, Max]     Alex
7       4    [Alex, Mary, Max]     Mary
8       4    [Alex, Mary, Max]      Max

Author_x是完整的作者列表，包括Author_y。现在可以使用以下方法将完整的作者列表（Author_x）与每个单独/唯一的作者（Author_y）进行比较

Create dict whose keys are unique ^{} values（即唯一作者）和值是空列表
迭代dict中的每个键值对
使用Author_y列对上述步骤中的合并数据帧进行切片；这将在dict键中为author提供所有authors
从slice获取所有作者的列表（Author_x）作为扁平列表
extend blank list带difference between flattened list (all authors) and dict key

d = {auth:[] for auth in df['Author'].unique()}
for k,v in d.items():
    all_auths = merged[merged['Author_y']==k]['Author_x'].values.tolist()
    auths = [coauths for nested in all_auths for coauths in nested]
    v.extend(list(set(auths) - set([k])))

最后，放入DataFrame并计算每行的非空值

cnames = ['coauth'+str(k) for k in range(1,len(d))]
df_summary = pd.DataFrame.from_dict(d, orient='index', columns=cnames)
df_summary['Num_Unique_CoAuthors'] = df_summary.shape[1] - df_summary.isna().sum(axis=1)
print(df_summary)
  author coauth1 coauth2 coauth3 coauth4  Num_Unique_CoAuthors
0   John    Alex   Jenna    None    None                     2
1   Alex     Max    John    Mary   Jenna                     4
2  Jenna    John    Alex    None    None                     2
3   Mary     Max    Alex    None    None                     2
4    Max    Alex    Mary    None    None                     2

扩展数据案例

如果主数据包含单个作者（即没有任何共同作者），则此方法为该行打印零

下面是添加到数据中的虚拟行，只有一个作者

print(df)
   BookID Author
0       1   John
1       1   Alex
2       1  Jenna
3       2   John
4       2   Alex
5       3   John
6       4   Alex
7       4   Mary
8       4    Max
9       5    Tom

这是输出

  author coauth1 coauth2 coauth3 coauth4  Num_Unique_CoAuthors
0   John   Jenna    Alex    None    None                     2
1   Alex    Mary    John   Jenna     Max                     4
2  Jenna    John    Alex    None    None                     2
3   Mary     Max    Alex    None    None                     2
4    Max    Mary    Alex    None    None                     2
5    Tom    None    None    None    None                     0

初始答案

你有没有用sum聚合来尝试groupby

df.groupby(['Author'])['BookID'].sum()

相关问题更多 >

编程相关推荐

热门问题

热门文章

计算datafram中唯一合著者的数目

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >