计算python pandas中两列之间相同单词的数量

3条回答

网友

1楼 · 编辑于 2024-05-15 20:13:45

一条线。。。因为，为什么不呢？不管怎样，我是来投票给MaxU的答案的。我还是自己留一个吧。在

df.join(
    df.applymap(lambda x: set(x.split())).pipe(
        lambda d: d.friend_definition - (d.friend_definition - d.friend_description)
    ).pipe(lambda s: pd.DataFrame(dict(word_overlap=s, overlap_count=s.str.len())))
)

  friend_description friend_definition  overlap_count word_overlap
0      James is dumb         dumb dude              1       {dumb}
1     Jacob is smart         smart guy              1      {smart}
2     Jane is pretty  she looks pretty              1     {pretty}
3      Susan is rich       she is rich              2   {rich, is}

网友

2楼 · 编辑于 2024-05-15 20:13:45

使用此类字符串时，简单列表理解似乎是最快的方法：

In [112]: df['word_overlap'] = [set(x[0].split()) & set(x[1].split()) for x in df.values]

In [113]: df['overlap_count'] = df['word_overlap'].str.len()

In [114]: df
Out[114]:
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {rich, is}              2

单个apply(..., axis=1)：

^{pr2}$

apply().apply(..., axis=1)方法：

In [23]: df['word_overlap'] = (df.apply(lambda x: x.str.split(expand=False))
    ...:                         .apply(lambda r: set(r['friend_description']) & set(r['friend_definition']),
    ...:                                axis=1))
    ...:

In [24]: df['overlap_count'] = df['word_overlap'].str.len()

In [25]: df
Out[25]:
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {is, rich}              2

计时针对40000行数据流：

In [104]: df = pd.concat([df] * 10**4, ignore_index=True)

In [105]: df.shape
Out[105]: (40000, 2)

In [106]: %timeit [set(x[0].split()) & set(x[1].split()) for x in df.values]
223 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [107]: %timeit df.apply(lambda r: set(r['friend_description'].split()) & set(r['friend_definition'].split()), axis=1)
3.65 s ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [108]: %timeit df.apply(lambda x: x.str.split(expand=False)).apply(lambda r: set(r['friend_description']) & set(r['friend_definition']),
     ...: axis=1)
4.63 s ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

网友

3楼 · 编辑于 2024-05-15 20:13:45

对于凡人（比如我）来说更容易理解？在

>>> import pandas as pd
>>> df = pd.read_csv('user98235.csv', sep='\t')
>>> def f(columns):
...     f_desc, f_def = columns[0], columns[1]
...     common = set(f_desc.split()).intersection(set(f_def.split()))
...     return common, len(common)
... 
>>> df[['word_overlap', 'overlap_count']] = df.apply(f, axis=1, raw=True).apply(pd.Series)
>>> df
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {is, rich}              2

相关问题更多 >

编程相关推荐

热门问题

热门文章

计算python pandas中两列之间相同单词的数量

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >