<p>你可以用TF-IDF来查语料库。在</p>
<pre><code>docs = ['this is me','this was not that you thought', 'lets test them'] ## create a list of documents
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
vec.fit(docs) ##fit your documents
print(vec.vocabulary_) #print vocabulary, don't run for 2.5 million documents
</code></pre>
<p>Output:包含每个单词的idf,并在输出中为其分配一个唯一的索引</p>
^{pr2}$
<p>输出:打印每个词汇表单词的idf值</p>
<pre><code>[ 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.28768207 1.69314718 1.69314718 1.69314718]
</code></pre>
<p>现在根据你的问题,假设你想找tf-idf,那么你可以得到:</p>
<pre><code>word = 'thought' #example
index = vec.vocabulary_[word]
>8
print(vec.idf_[index]) #prints idf value
>1.6931471805599454
</code></pre>
<p>参考文献:
1<a href="https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/" rel="nofollow noreferrer">prepare-text</a></p>
<p>现在对文本进行同样的操作</p>
<pre><code>import spacy
nlp = spacy.load('en') ## install it by python -m spacy download en (run as administrator)
doc_strings = [
'this is me','this was not that you thought', 'lets test them'
]
docs = [nlp(string.lower()) for string in doc_strings]
corpus = textacy.Corpus(nlp,docs =docs)
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
doc_term_matrix = vectorizer.fit_transform((doc.to_terms_list(ngrams=1, normalize='lower',as_strings=True,filter_stops=False) for doc in corpus))
print(vectorizer.terms_list)
print(doc_term_matrix.toarray())
</code></pre>
<p>输出</p>
<pre><code>['is', 'lets', 'me', 'not', 'test', 'that', 'them', 'this', 'thought','was', 'you']
[[1.69314718 0. 1.69314718 0. 0. 0.
0. 1.28768207 0. 0. 0. ]
[0. 0. 0. 1.69314718 0. 1.69314718
0. 1.28768207 1.69314718 1.69314718 1.69314718]
[0. 1.69314718 0. 0. 1.69314718 0.
1.69314718 0. 0. 0. 0. ]]
</code></pre>
<p>参考号:<a href="https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#make-a-corpus" rel="nofollow noreferrer">link</a></p>