Pandas计算行中字符串并发数的最快方法（句子中的单词）

2024-09-30 14:27:28 发布

您现在位置：Python中文网/ 问答频道 /正文

7739

网友

男 | 程序猿一只，喜欢编程写python代码。

假设数据框的一列包含句子（用逗号分隔的单词），然后我有单词的组合（也有用逗号分隔的单词）单词不会重复，也不会出现在句子或组合中。我需要计算每个单词组合出现的句子数量，顺序独立

我有一列pandas dataframe，如下所示：

df.句子=

0                         GO:0002576,GO:0008150,GO:0043312
1        GO:0001869,GO:0002576,GO:0007597,GO:0010466,GO...
2        GO:0006400,GO:0006412,GO:0006418,GO:0006419,GO...
3        GO:0007416,GO:0030036,GO:0030097,GO:0032092,GO...
4        GO:0002407,GO:0006816,GO:0006874,GO:0006887,GO...
                               ...                        
14503          GO:0002221,GO:0002223,GO:0002376,GO:0045087
14504                                GO:0003351,GO:0048240
14505    GO:0001889,GO:0006351,GO:0006355,GO:0006357,GO...
14506    GO:0006892,GO:0007596,GO:0008089,GO:0016081,GO...
14507    GO:0000209,GO:0007030,GO:0008283,GO:0016567,GO...
Name: annots, Length: 14508, dtype: object

然后是从上述列派生的字符串的不同组合的列表：组合=

0                                                  GO:0007165
1                                                  GO:0007186
2                                                  GO:0007155
3                                                  GO:0006954
4                                                  GO:0019221
                                  ...                        
16778101    GO:0000165,GO:0000209,GO:0002223,GO:0006521,GO...
16778102    GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778103    GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778104    GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778105    GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
Name: itemsets, Length: 16778106, dtype: object

我现在所拥有的：

setsentences = [set(sentece.split(',')) for sentence in df.sentences]
combinations = [set(comb.split(',')) for comb in combinations]

sentenceCount = {}
for comb in combinations:
    sentenceCount[','.join(comb)] = sum([comb.issubset(sentence) for sentence in setsentences])

这里的问题（IMO）是一个16778105次迭代的循环。。。有没有一种方法可以使用apply（应用）或map（映射）来快速计算句子？也许把单词转换成数字？使用正则表达式

我希望我已经充分解释了我自己。感谢您在advanced中抽出时间

要创建测试示例，请执行以下操作：

import random,string

chars = string.ascii_uppercase + string.ascii_lowercase + string.digits
nsentences = 18000
sentences = [','.join(random.sample(chars, random.randint(1,len(chars)))) for sentence in range(nsentences)]

ncombinations = 1000000
combinations = [','.join(random.sample(chars, random.randint(1,len(chars)))) for sentence in range(ncombinations)]

setsentences = [set(sentence.split(',')) for sentence in sentences]
combinations = [set(comb.split(',')) for comb in combinations]

sentenceCount = {}
for comb in combinations:
    sentenceCount[','.join(comb)] = sum([comb.issubset(sentence) for sentence in setsentences])

Tags： in go for random 单词 sentence 句子 split

0条回答

目前没有回答

Pandas计算行中字符串并发数的最快方法（句子中的单词）

相关问题更多 >

编程相关推荐

热门问题

热门文章

Pandas计算行中字符串并发数的最快方法（句子中的单词）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >