<p>这是一个稍微快一点的代码版本,很抱歉我不太了解numpy,但也许这会有所帮助,<code>enumerate</code>和<code>defaultdict(int)</code>是我所做的更改(你不必接受这个答案,只是想帮忙)</p>
<pre><code>from collections import defaultdict
#build a vocabulary with the number of ocorrences
vocab = defaultdict(int)
with open(DATASET_FILE) as file_handle:
for count,line in enumerate(file_handle):
for word in line.split():
vocab[word] += 1
if not count % 100000:
print(count, "documents processed")
</code></pre>
<p>另外,对于for循环(运行Python 3.44)中的增量,从0开始时的<code>defaultdict(int)</code>似乎是<code>Counter()</code>的两倍:</p>
<pre><code>from collections import Counter
from collections import defaultdict
import time
words = " ".join(["word_"+str(x) for x in range(100)])
lines = [words for i in range(100000)]
counter_dict = Counter()
default_dict = defaultdict(int)
start = time.time()
for line in lines:
for word in line.split():
counter_dict[word] += 1
end = time.time()
print (end-start)
start = time.time()
for line in lines:
for word in line.split():
default_dict[word] += 1
end = time.time()
print (end-start)
</code></pre>
<p>结果:</p>
<pre><code>5.353034019470215
2.554084062576294
</code></pre>
<p>如果你想对这项索赔提出异议,我请你回答这个问题:<a href="https://stackoverflow.com/questions/27801945/surprising-results-with-python-timeit-counter-vs-defaultdict-vs-dict">Surprising results with Python timeit: Counter() vs defaultdict() vs dict()</a></p>