在python中查找文本文件中每个单词的频率

网友

1楼 · 编辑于 2024-09-30 14:15:49

我看到你在用这个例子，看到了和你看到的一样的东西，为了让它正常工作，你必须用空格分开字符串。如果你不这样做，它似乎计数每个字符，这是你看到的。这将返回每个单词的正确计数，而不是字符。

import nltk

text1 = 'hello he heloo hello hi '
text1 = text1.split(' ')
fdist1 = nltk.FreqDist(text1)
print (fdist1.most_common(50))

如果要读取文件并获取字数，可以这样做：

input.txt

hello he heloo hello hi
my username is heinst
your username is frooty

python代码

import nltk

with open ("input.txt", "r") as myfile:
    data=myfile.read().replace('\n', ' ')

data = data.split(' ')
fdist1 = nltk.FreqDist(data)
print (fdist1.most_common(50))

网友

2楼 · 编辑于 2024-09-30 14:15:49

nltk book中的text1是标记（单词、标点符号）的集合，与代码示例中的text1是字符串（Unicode代码点的集合）不同：

>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
          'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})

如果您的输入确实是空格分隔的单词，那么要查找频率，请使用@Boa's answer：

freq = Counter(text_with_space_separated_words.split())

注：FreqDist是一个Counter，但它还定义了其他方法，如.plot()。

如果要改用nltk标记器：

#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk

with open('your_text.txt') as file:
    text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})

sent_tokenize()将文本标记为句子。然后word_tokenize将每个句子标记为单词。There are many ways to tokenize text in ^{}.

网友

3楼 · 编辑于 2024-09-30 14:15:49

就它的价值而言，NLTK似乎对这项任务来说是过分的。下面将按从高到低的顺序为您提供单词频率。

from collections import Counter
input_string = [...] # get the input from a file
word_freqs = Counter(input_string.split())

input.txt

python代码

相关问题更多 >

编程相关推荐

热门问题

热门文章

在python中查找文本文件中每个单词的频率

input.txt

python代码

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >