Pandas计算行中字符串并发数的最快方法(句子中的单词)

2024-09-30 14:27:28 发布

您现在位置:Python中文网/ 问答频道 /正文

假设数据框的一列包含句子(用逗号分隔的单词),然后我有单词的组合(也有用逗号分隔的单词)单词不会重复,也不会出现在句子或组合中。我需要计算每个单词组合出现的句子数量,顺序独立

我有一列pandas dataframe,如下所示:

df.句子=

0                         GO:0002576,GO:0008150,GO:0043312
1        GO:0001869,GO:0002576,GO:0007597,GO:0010466,GO...
2        GO:0006400,GO:0006412,GO:0006418,GO:0006419,GO...
3        GO:0007416,GO:0030036,GO:0030097,GO:0032092,GO...
4        GO:0002407,GO:0006816,GO:0006874,GO:0006887,GO...
                               ...                        
14503          GO:0002221,GO:0002223,GO:0002376,GO:0045087
14504                                GO:0003351,GO:0048240
14505    GO:0001889,GO:0006351,GO:0006355,GO:0006357,GO...
14506    GO:0006892,GO:0007596,GO:0008089,GO:0016081,GO...
14507    GO:0000209,GO:0007030,GO:0008283,GO:0016567,GO...
Name: annots, Length: 14508, dtype: object

然后是从上述列派生的字符串的不同组合的列表: 组合=

0                                                  GO:0007165
1                                                  GO:0007186
2                                                  GO:0007155
3                                                  GO:0006954
4                                                  GO:0019221
                                  ...                        
16778101    GO:0000165,GO:0000209,GO:0002223,GO:0006521,GO...
16778102    GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778103    GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778104    GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778105    GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
Name: itemsets, Length: 16778106, dtype: object

我现在所拥有的:

setsentences = [set(sentece.split(',')) for sentence in df.sentences]
combinations = [set(comb.split(',')) for comb in combinations]

sentenceCount = {}
for comb in combinations:
    sentenceCount[','.join(comb)] = sum([comb.issubset(sentence) for sentence in setsentences])

这里的问题(IMO)是一个16778105次迭代的循环。。。有没有一种方法可以使用apply(应用)或map(映射)来快速计算句子?也许把单词转换成数字?使用正则表达式

我希望我已经充分解释了我自己。感谢您在advanced中抽出时间

要创建测试示例,请执行以下操作:

import random,string

chars = string.ascii_uppercase + string.ascii_lowercase + string.digits
nsentences = 18000
sentences = [','.join(random.sample(chars, random.randint(1,len(chars)))) for sentence in range(nsentences)]

ncombinations = 1000000
combinations = [','.join(random.sample(chars, random.randint(1,len(chars)))) for sentence in range(ncombinations)]

setsentences = [set(sentence.split(',')) for sentence in sentences]
combinations = [set(comb.split(',')) for comb in combinations]

sentenceCount = {}
for comb in combinations:
    sentenceCount[','.join(comb)] = sum([comb.issubset(sentence) for sentence in setsentences])


Tags: ingoforrandom单词sentence句子split