计算python pandas中两列之间相同单词的数量

2024-05-15 20:13:45 发布

您现在位置:Python中文网/ 问答频道 /正文

假设在python pandas中有下表

friend_description  friend_definition
    James is dumb      dumb dude
    Jacob is smart     smart guy
    Jane is pretty     she looks pretty
    Susan is rich      she is rich

这里,在第一行中,两列中都包含单词“dumb”。在第二行中,“smart”包含在两列中。在第三行中,“pretty”包含在两列中,在最后一行中,“is”和“rich”包含在两列中。我想创建以下列:

^{pr2}$

我可以使用for循环来手动定义包含这些内容的新列,但是我想知道pandas中是否有一个函数可以使这种类型的操作更加流畅。在


Tags: friendpandasissmartprettydescriptiondumbjacob
3条回答

一条线。。。因为,为什么不呢?不管怎样,我是来投票给MaxU的答案的。我还是自己留一个吧。在

df.join(
    df.applymap(lambda x: set(x.split())).pipe(
        lambda d: d.friend_definition - (d.friend_definition - d.friend_description)
    ).pipe(lambda s: pd.DataFrame(dict(word_overlap=s, overlap_count=s.str.len())))
)

  friend_description friend_definition  overlap_count word_overlap
0      James is dumb         dumb dude              1       {dumb}
1     Jacob is smart         smart guy              1      {smart}
2     Jane is pretty  she looks pretty              1     {pretty}
3      Susan is rich       she is rich              2   {rich, is}

使用此类字符串时,简单列表理解似乎是最快的方法:

In [112]: df['word_overlap'] = [set(x[0].split()) & set(x[1].split()) for x in df.values]

In [113]: df['overlap_count'] = df['word_overlap'].str.len()

In [114]: df
Out[114]:
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {rich, is}              2

单个apply(..., axis=1)

^{pr2}$

apply().apply(..., axis=1)方法:

In [23]: df['word_overlap'] = (df.apply(lambda x: x.str.split(expand=False))
    ...:                         .apply(lambda r: set(r['friend_description']) & set(r['friend_definition']),
    ...:                                axis=1))
    ...:

In [24]: df['overlap_count'] = df['word_overlap'].str.len()

In [25]: df
Out[25]:
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {is, rich}              2

计时针对40000行数据流:

In [104]: df = pd.concat([df] * 10**4, ignore_index=True)

In [105]: df.shape
Out[105]: (40000, 2)

In [106]: %timeit [set(x[0].split()) & set(x[1].split()) for x in df.values]
223 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [107]: %timeit df.apply(lambda r: set(r['friend_description'].split()) & set(r['friend_definition'].split()), axis=1)
3.65 s ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [108]: %timeit df.apply(lambda x: x.str.split(expand=False)).apply(lambda r: set(r['friend_description']) & set(r['friend_definition']),
     ...: axis=1)
4.63 s ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

对于凡人(比如我)来说更容易理解?在

>>> import pandas as pd
>>> df = pd.read_csv('user98235.csv', sep='\t')
>>> def f(columns):
...     f_desc, f_def = columns[0], columns[1]
...     common = set(f_desc.split()).intersection(set(f_def.split()))
...     return common, len(common)
... 
>>> df[['word_overlap', 'overlap_count']] = df.apply(f, axis=1, raw=True).apply(pd.Series)
>>> df
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {is, rich}              2

相关问题 更多 >