如何计算语料库中的单词

.... full=nltk.Text(mycorpus.words('FullReport.txt')) >>> fdist= FreqDist(full) >>> fdist <FreqDist with 34133 outcomes> // HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS "students, trust, ayre" occur in full.

3条回答

网友

1楼 · 编辑于 2024-09-28 23:22:37

大多数人只使用默认字典（默认值为0）。每次看到一个单词，只需将该值增加一：

total = 0
count = defaultdict(lambda: 0)
for word in words:
    total += 1
    count[word] += 1

# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
     print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))

网友

2楼 · 编辑于 2024-09-28 23:22:37

我建议你查一下收款柜台。尤其是对于大量的文本，这样做的诀窍，只有有限的可用内存。在一台拥有12Gb内存的电脑上，它一天半的时间就计算出300亿个代币。伪代码（变量字实际上是对文件或类似文件的引用）：

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

完成后，单词会被放入字典my_counter中，然后可以将其写入磁盘或存储在其他地方（例如sqlite）。

网友

3楼 · 编辑于 2024-09-28 23:22:37

你快到了！你可以用你感兴趣的词来索引FreqDist。请尝试以下操作：

print fdist['students']
print fdist['ayre']
print fdist['full']

这将为您提供每个单词的出现次数。你说的“频率”与发生次数不同，可能是这样的：

print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')

相关问题更多 >

编程相关推荐

热门问题

热门文章