我已经组合了许多文件(第1批)的标记,创建了一个主单词频率列表,现在我正在与一系列其他文件(第2批)进行比较。最初,我创建了一个二进制输出,如果主单词列表和批处理2中的给定文件中都有一个单词,它将输出“1”,如果没有,则输出“0”。例如[1,0,1,1]
现在,我希望它输出出现的单词的频率,即如果“cat”在主单词频率列表中出现9次,并且在文件1第2批中,它将输出“9”而不是“1”。例如[9,0,21,42]
# globalFreqSets generates a dictionary like output: ('to', 634), ('be', 604), ('and', 594)
# finalValues generates just the number element of globalFreqSets: [634, 604, 594]
output = []
for text in doc_text:
binarySim = []
# creates loop to indirectly navigate through "globalFreqSets".
# only the first item needs to be retrieved ('patient') hence the second item is set to [0] .
for j in range(len(globalFreqSets)):
master_wordlist = globalFreqSets[j][0]
i = 0
# looping through words in list "text"
for sub_wordlist in text:
i += 1
# adds 1 to "binarySim" when target word in master_wordlist is present in the sub_wordlist
if master_wordlist == sub_wordlist:
binarySim.append(1)
# breaks when a match is found to avoid multiple entries per word
break
# adds 0 to "binarySim" when target word in master_wordlist is not present in the sub_wordlist
elif i == len(text):
binarySim.append(0)
# adding "binarySim" to "output"
output.extend([binarySim])
抱歉,如果这是错误的格式或措辞,我还是相当新的编码:)
目前没有回答
相关问题 更多 >
编程相关推荐