<h2>数据:</h2>
<pre class="lang-py prettyprint-override"><code>Subject
"Call Out: Quadria Capital - May Lo, VP"
Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
Columbia Partners: WW Worked (Not Sure Will Ev...
"Meeting, Sophie, CFO, CDC Investment"
Prospecting
# read in the data
df = pd.read_clipboard(sep=',')
</code></pre>
<p><a href="https://i.stack.imgur.com/Cs2no.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/Cs2no.png" alt="enter image description here"/></a></p>
<h2>更新代码:</h2>
<ul>
<li>将所有单词转换为小写,并删除所有非字母数字字符
<ul>
<li><code>txt = df.Subject.str.lower().str.replace(r'\|', ' ')</code>创建<code>pandas.core.series.Series</code>并将被替换</li>
</ul></li>
<li><code>words = nltk.tokenize.word_tokenize(txt)</code>,抛出一个<code>TypeError</code>,因为<code>txt</code>是一个<code>Series</code>。
<ul>
<li>下面的代码标记数据帧的每一行</li>
</ul></li>
<li>对单词进行标记,将每个字符串分割成<code>list</code>。在本例中,查看<code>df</code>将显示一个<code>tok</code>列,其中每一行都是一个列表</li>
</ul>
<pre class="lang-py prettyprint-override"><code>import nltk
import pandas as pd
top_N = 50
# replace all non-alphanumeric characters
df['sub_rep'] = df.Subject.str.lower().str.replace('\W', ' ')
# tokenize
df['tok'] = df.sub_rep.apply(nltk.tokenize.word_tokenize)
</code></pre>
<p><a href="https://i.stack.imgur.com/lJBnW.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/lJBnW.png" alt="enter image description here"/></a></p>
<ul>
<li>为了分析列中的所有单词,将各个行列表合并到一个名为<code>words</code>的列表中。你知道吗</li>
</ul>
<pre class="lang-py prettyprint-override"><code># all tokenized words to a list
words = df.tok.tolist() # this is a list of lists
words = [word for list_ in words for word in list_]
# frequency distribution
word_dist = nltk.FreqDist(words)
# remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
# output the results
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
</code></pre>
<h3>输出<code>rslt</code>:</h3>
<pre><code> Word Frequency
call 2
out 2
quadria 1
capital 1
may 1
lo 1
vp 1
revelstoke 1
anthony 1
hayes 1
sr 1
assoc 1
columbia 1
partners 1
ww 1
worked 1
not 1
sure 1
will 1
ev 1
meeting 1
sophie 1
cfo 1
cdc 1
investment 1
prospecting 1
</code></pre>