在Python上挂载单词出现列表的有效方法

2024-10-05 10:01:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我想挂载一个数据结构,说明发生的数量和映射他们在正确的顺序。你知道吗

例如:

word_1 => 10 occurences

word_2 => 5 occurences

word_3 => 12 occurences

word_4 => 2 ocurrences

每个单词都有一个id来表示它:

kw2id = ['word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3]

所以一个有序的列表应该是:

ordered_vocab = [2, 0, 1, 3]

例如,我的代码是…:

#build a vocabulary with the number of ocorrences
vocab = {}
count = 0
for line in open(DATASET_FILE):
    for word in line.split():
        if word in vocab:
            vocab[word] += 1
        else:
            vocab[word] = 1
    count += 1
    if not count % 100000:
        print(count, "documents processed")

如何有效地执行此操作?你知道吗


Tags: inid数据结构for数量if顺序count
3条回答

这就是^{}的用途:

from collections import Counter
cnt = Counter()

with open(DATASET_FILE) as fp:
    for line in fp.readlines():
        for word in line.split():
            cnt[word] += 1

或者(使用发电机时更短更“漂亮”):

from collections import Counter

with open(DATASET_FILE) as fp:
    words = (word for line in fp.readlines() for word in line.split())
    cnt = Counter(words)

你可以用收款台. 计数器允许您输入一个列表,它将自动计算每个元素的出现次数。你知道吗

from collections import Counter
l = [1,2,2,3,3,3]
cnt = Counter(l)

因此,除了上面的答案之外,您可以做的是从文件中创建一个单词列表,并使用Counter和一个列表,而不是手动遍历列表中的每个元素。请注意,如果文件与内存相比太大,则此方法不适用。你知道吗

这是一个稍微快一点的代码版本,很抱歉我不太了解numpy,但也许这会有所帮助,enumeratedefaultdict(int)是我所做的更改(你不必接受这个答案,只是想帮忙)

from collections import defaultdict

#build a vocabulary with the number of ocorrences
vocab = defaultdict(int)
with open(DATASET_FILE) as file_handle:
    for count,line in enumerate(file_handle):
        for word in line.split():
            vocab[word] += 1
        if not count % 100000:
            print(count, "documents processed")

另外,对于for循环(运行Python 3.44)中的增量,从0开始时的defaultdict(int)似乎是Counter()的两倍:

from collections import Counter
from collections import defaultdict
import time

words = " ".join(["word_"+str(x) for x in range(100)])
lines = [words for i in range(100000)]

counter_dict = Counter()
default_dict = defaultdict(int)

start = time.time()
for line in lines:
    for word in line.split():
        counter_dict[word] += 1
end = time.time()
print (end-start)

start = time.time()
for line in lines:
    for word in line.split():
        default_dict[word] += 1
end = time.time()
print (end-start)

结果:

5.353034019470215
2.554084062576294

如果你想对这项索赔提出异议,我请你回答这个问题:Surprising results with Python timeit: Counter() vs defaultdict() vs dict()

相关问题 更多 >

    热门问题