在python中有效地计算单词频率

0条回答

网友

1楼 · 发布于 2024-05-20 00:00:55

最简洁的方法是使用Python提供的工具。

from future_builtins import map  # Only on Python 2

from collections import Counter
from itertools import chain

def countInFile(filename):
    with open(filename) as f:
        return Counter(chain.from_iterable(map(str.split, f)))

就这样。map(str.split, f)正在生成一个生成器，从每行返回lists个单词。包装chain.from_iterable将其转换为一次生成一个单词的单个生成器。Counter接受一个input iterable并计算其中的所有唯一值。最后，像returna dict一样的对象（a Counter）存储所有唯一的单词及其计数，在创建过程中，一次只存储一行数据和总计数，而不是一次存储整个文件。

理论上，在Python2.7和3.1上，您可以自己对链接的结果进行循环，并使用dict或collections.defaultdict(int)进行计数（因为Counter是在Python中实现的，在某些情况下可能会使其变慢），但是让Counter完成这项工作更简单，也更具自文档性（我的意思是，整个目标是计数，所以使用一个Counter）。除此之外，在C Python（引用解释器）3.2和更高版本上，Counter有一个C级加速器，用于计算iterable输入，它的运行速度比纯Python编写的任何东西都要快。

更新：您似乎希望去掉标点符号，并且不区分大小写，下面是我以前代码的一个变体：

from string import punctuation

def countInFile(filename):
    with open(filename) as f:
        linewords = (line.translate(None, punctuation).lower().split() for line in f)
        return Counter(chain.from_iterable(linewords))

您的代码运行得慢得多，因为它正在创建和销毁许多小的Counter和set对象，而不是.update-每行一次Counter（虽然比我在更新的代码块中给出的速度稍慢，但至少在算法上与缩放因子相似）。

网友

2楼 · 发布于 2024-05-20 00:00:55

这就足够了。

def countinfile(filename):
    d = {}
    with open(filename, "r") as fin:
        for line in fin:
            words = line.strip().split()
            for word in words:
                try:
                    d[word] += 1
                except KeyError:
                    d[word] = 1
    return d

网友

3楼 · 发布于 2024-05-20 00:00:55

一种高效准确的记忆方法是利用

scikit中的计数器矢量器（用于ngram提取）
用于word_tokenize的NLTK
numpy收集计数的矩阵和
collections.Counter用于收集计数和词汇

例如：

import urllib.request
from collections import Counter

import numpy as np 

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')


# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))

# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())

# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1

freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))

[出局]：

[(',', 32000),
 ('.', 17783),
 ('de', 11225),
 ('a', 7197),
 ('que', 5710),
 ('la', 4732),
 ('je', 4304),
 ('se', 4013),
 ('на', 3978),
 ('na', 3834)]

实际上，您也可以这样做：

from collections import Counter
import numpy as np 
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
    X = ngram_vectorizer.fit_transform(data.split('\n'))
    vocab = list(ngram_vectorizer.get_feature_names())
    counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

让我们timeit：

import time

start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)

[出局]：

5.257147789001465

注意^{}也可以使用一个文件而不是一个字符串，这里不需要将整个文件读入内存。代码中：

import io
from collections import Counter

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/input.txt'

ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)

with io.open(infile, 'r', encoding='utf8') as fin:
    X = ngram_vectorizer.fit_transform(fin)
    vocab = ngram_vectorizer.get_feature_names()
    counts = X.sum(axis=0).A1
    freq_distribution = Counter(dict(zip(vocab, counts)))
    print (freq_distribution.most_common(10))

更新

相关问题更多 >

编程相关推荐

热门问题

热门文章