<p>下面给出了一种NLTK方法,其效果相对较好。作者无法复制相同的<em>sampledict</em>,因此为了本练习,它是从<em>sampletext</em>创建的。
<em>注:提问者给出的方法需要大约60倍的时间</em></p>
<p>来源数据:</p>
<pre><code>#Invoke libraries
import nltk
import requests
import timeit
import pandas as pd
#Souce sample data
payload = {'action': 'query', 'titles': 'Ceramic_capacitor', 'explaintext':1, 'prop':'extracts', 'format': 'json'}
r = requests.get('https://en.wikipedia.org/w/api.php', params=payload)
sampletext = r.json()['query']['pages']['9221221']['extract'].lower()
sampledict = sampletext.split(' ')
</code></pre>
<p>按旧方法计时:</p>
<pre><code>start = timeit.default_timer()
termfreqdic = {}
for term in sampledict:
termfreqdic[term] = sampletext.count(term)
stop = timeit.default_timer()
timetaken = stop-start
stop - start
#0.42748349941757624
</code></pre>
<p>NLTK方法的时间:</p>
<pre><code>start = timeit.default_timer()
wordFreq = nltk.FreqDist(sampledict)
stop = timeit.default_timer()
timetaken = stop-start
stop - start
#0.00713308053673245
</code></pre>
<p>通过将频率分布转换为数据帧来访问数据</p>
<pre><code>wordFreqDf = pd.DataFrame(list(wordFreq.items()), columns = ["Word","Frequency"])
#Inspect data
wordFreqDf.head(10)
#output
# Word Frequency
#0 60384-8/21 1
#1 limited 2
#2 3618
#3 comparatively 1
#4 code/month 1
#5 four 1
#6 (microfarads):\n\nµ47 1
#7 consists 1
#8 α\n\t\t\n\t\t\n\n\n=== 1
</code></pre>