<p>函数调用和全局命名空间搜索的开销更大。在</p>
<p>您的<code>stupid</code>函数对单词列表中的每个元素进行2次函数调用。您的<code>max</code>调用中的第二个是完全可以避免的,即迭代dict的键,然后对于每个键使用<code>dict.get</code>查找值,当您可以迭代键值对时,这是一个明显的低效率。在</p>
<pre><code>def stupid(words):
freqs = {}
for w in words:
freqs[w] = freqs.get(w, 0) + 1
return max(freqs, key=freqs.get)
def most_frequent(words):
## Build the frequency dict
freqs = {}
for w in words:
if w in freqs:
freqs[w] += 1
else:
freqs[w] = 1
## Search the frequency dict
m_k = None
m_v = 0
for k, v in freqs.iteritems():
if v > m_v:
m_k, m_v = k, v
return m_k, m_v
</code></pre>
<hr/>
<p>使用user1952500的单通建议,这在您的大型样本集上表现如何?在</p>
^{pr2}$
<p>对于多个最常见的值来说,这有一个小小的优点,即稳定。在</p>
<hr/>
<p>使用cd4{生成所有建议的样本:</p>
<pre><code>def word_frequency_version1(words):
"""Petar's initial"""
freqs = {}
for w in words:
freqs[w] = freqs.get(w, 0) + 1
return max(freqs, key=freqs.get)
def word_frequency_version2(words):
"""Matt's initial"""
## Build the frequency dict
freqs = {}
for w in words:
if w in freqs:
freqs[w] += 1
else:
freqs[w] = 1
## Search the frequency dict
m_k = None
m_v = 0
for k, v in freqs.iteritems():
if v > m_v:
m_k, m_v = k, v
return m_k, m_v
def word_frequency_version3(words):
"""Noting max as we go"""
freq = {}
m_k = None
m_v = 0
for w in words:
if w in freq:
v = freq[w] + 1
else:
v = 1
freq[w] = v
if v > m_v:
m_k = w
m_v = v
return m_k, m_v
from collections import Counter
def word_frequency_version4(words):
"""Built-in Counter"""
c = Counter(words)
return c.most_common()[0]
from multiprocessing import Pool
def chunked(seq,count):
v = len(seq) / count
for i in range(count):
yield seq[i*v:v+i*v]
def frequency_map(words):
freq = {}
for w in words:
if w in freq:
freq[w] += 1
else:
freq[w] = 1
return freq
def frequency_reduce(results):
freq = {}
for result in results:
for k, v in result.iteritems():
if k in freq:
freq[k] += v
else:
freq[k] = v
m_k = None
m_v = None
for k, v in freq.iteritems():
if v > m_v:
m_k = k
m_v = v
return m_k, m_v
# def word_frequency_version5(words,chunks=5,pool_size=5):
# pool = Pool(processes=pool_size)
# result = frequency_reduce(pool.map(frequency_map,chunked(words,chunks)))
# pool.close()
# return result
def word_frequency_version5(words,chunks=5,pool=Pool(processes=5)):
"""multiprocessing Matt's initial suggestion"""
return frequency_reduce(pool.map(frequency_map,chunked(words,chunks)))
def word_frequency_version6(words):
"""Petar's one-liner"""
return max(set(words),key=words.count)
import timeit
freq1 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version1 as func; print func.__doc__')
freq2 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version2 as func; print func.__doc__')
freq3 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version3 as func; print func.__doc__')
freq4 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version4 as func; print func.__doc__')
freq5 = timeit.Timer('func(words,chunks=chunks)','from __main__ import words, word_frequency_version5 as func; print func.__doc__; chunks=10')
freq6 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version6 as func; print func.__doc__')
</code></pre>
<p>结果:</p>
<pre><code>>>> print "n={n}, m={m}".format(n=len(words),m=len(set(words)))
n=692766, m=34464
>>> freq1.timeit(10)
"Petar's initial"
3.914874792098999
>>> freq2.timeit(10)
"Matt's initial"
3.8329160213470459
>>> freq3.timeit(10)
"Noting max as we go"
4.1247420310974121
>>> freq4.timeit(10)
"Built-in Counter"
6.1084718704223633
>>> freq5.timeit(10)
"multiprocessing Matt's initial suggestion"
9.7867341041564941
</code></pre>
<p>注意事项:</p>
<ul>
<li>我用<code>multiprocessing.Pool</code>实例作为kwarg进行作弊,以达到计时目的,因为我想避免池的启动成本,<code>timeit</code>不允许您指定清理代码。这是在一个“四”cpu虚拟机上运行的,我确信对于输入数据和cpu计数的某些值,多处理将更快。在</li>
<li>大多数情况下,返回频率最高的单词,如果第一名出现平局,这可能是随机的。在</li>
<li>最高频率的近似值可能更快(使用采样),但将是近似值。在</li>
<li>对于<code>n*m</code>的大值,应忽略版本6(一行程序)。在</li>
</ul>