假设数据框的一列包含句子(用逗号分隔的单词),然后我有单词的组合(也有用逗号分隔的单词)单词不会重复,也不会出现在句子或组合中。我需要计算每个单词组合出现的句子数量,顺序独立
我有一列pandas dataframe,如下所示:
df.句子=
0 GO:0002576,GO:0008150,GO:0043312
1 GO:0001869,GO:0002576,GO:0007597,GO:0010466,GO...
2 GO:0006400,GO:0006412,GO:0006418,GO:0006419,GO...
3 GO:0007416,GO:0030036,GO:0030097,GO:0032092,GO...
4 GO:0002407,GO:0006816,GO:0006874,GO:0006887,GO...
...
14503 GO:0002221,GO:0002223,GO:0002376,GO:0045087
14504 GO:0003351,GO:0048240
14505 GO:0001889,GO:0006351,GO:0006355,GO:0006357,GO...
14506 GO:0006892,GO:0007596,GO:0008089,GO:0016081,GO...
14507 GO:0000209,GO:0007030,GO:0008283,GO:0016567,GO...
Name: annots, Length: 14508, dtype: object
然后是从上述列派生的字符串的不同组合的列表: 组合=
0 GO:0007165
1 GO:0007186
2 GO:0007155
3 GO:0006954
4 GO:0019221
...
16778101 GO:0000165,GO:0000209,GO:0002223,GO:0006521,GO...
16778102 GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778103 GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778104 GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778105 GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
Name: itemsets, Length: 16778106, dtype: object
我现在所拥有的:
setsentences = [set(sentece.split(',')) for sentence in df.sentences]
combinations = [set(comb.split(',')) for comb in combinations]
sentenceCount = {}
for comb in combinations:
sentenceCount[','.join(comb)] = sum([comb.issubset(sentence) for sentence in setsentences])
这里的问题(IMO)是一个16778105次迭代的循环。。。有没有一种方法可以使用apply(应用)或map(映射)来快速计算句子?也许把单词转换成数字?使用正则表达式
我希望我已经充分解释了我自己。感谢您在advanced中抽出时间
要创建测试示例,请执行以下操作:
import random,string
chars = string.ascii_uppercase + string.ascii_lowercase + string.digits
nsentences = 18000
sentences = [','.join(random.sample(chars, random.randint(1,len(chars)))) for sentence in range(nsentences)]
ncombinations = 1000000
combinations = [','.join(random.sample(chars, random.randint(1,len(chars)))) for sentence in range(ncombinations)]
setsentences = [set(sentence.split(',')) for sentence in sentences]
combinations = [set(comb.split(',')) for comb in combinations]
sentenceCount = {}
for comb in combinations:
sentenceCount[','.join(comb)] = sum([comb.issubset(sentence) for sentence in setsentences])
目前没有回答
相关问题 更多 >
编程相关推荐