<p>基本上,您想要一些过滤在一组正确的文档中出现最多的单词的东西吗?
只需使用<a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html" rel="nofollow noreferrer">sklearn</a>中的CountVectorizer和所需的切割参数。这是使用<strong>max_df</strong>参数完成的。根据文档(<a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html" rel="nofollow noreferrer">CountVectorizer Documentation</a>)描述,max\u df确定以下内容:</p>
<p><strong>在构建词汇表时,忽略文档频率严格高于给定阈值的术语(语料库特定的停止词)</strong>。你知道吗</p>
<p>这样,你就可以忽略某些频率的单词。所以,只要做相反的过程,以消除单词,超过了限制,你想要的。你知道吗</p>
<p>例如:</p>
<pre><code>from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
data = ["Amet urna tincidunt efficitur - The Guardian",
"Yltricies hendrerit eu a nisi - The Guardian",
"Faucibus pharetra id quis arck - The Guardian",
"Net tristique facilisis | New York Times",
"Quis finibus lacinia | New York Times"]
vectorizer = CountVectorizer(max_df=0.3, lowercase=False, strip_accents=None)
X = vectorizer.fit_transform(data)
vocab = vectorizer.vocabulary_
cv_matrix = X.todense()
new_data = []
for idx_t, text in enumerate(data):
tokens = word_tokenize(text)
cv_matrix_ = cv_matrix[idx_t].tolist()[0]
new_text = []
for tok_ in tokens:
if tok_ in vocab.keys():
new_text.append(tok_)
new_data.append(" ".join(new_text))
</code></pre>
<p>结果:</p>
<pre><code>>>> new_data
['Amet urna tincidunt efficitur',
'Yltricies hendrerit eu nisi',
'Faucibus pharetra id quis arck',
'Net tristique facilisis',
'Quis finibus lacinia']
</code></pre>