<p>最后,我使用了受<a href="https://stackoverflow.com/a/8897648/957253">@larsmans answer</a>启发的<a href="http://en.wikipedia.org/wiki/Tf-idf" rel="nofollow noreferrer">Tf-idf</a>算法得到它:</p>
<p><strong:<strong>
Tf-idf(和类似的文本转换)在Python包Gensim和scikit-learn中实现。在后一个包中,计算余弦相似性很容易</p>
<pre><code>from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
</code></pre>
<p>或者,如果文档是纯字符串</p>
^{pr2}$
<p><strong>几个有用的链接:</strong></p>
<ul>
<li><a href="https://code.google.com/p/tfidf/" rel="nofollow noreferrer">https://code.google.com/p/tfidf/</a></li>
<li><a href="https://github.com/hrs/python-tf-idf" rel="nofollow noreferrer">https://github.com/hrs/python-tf-idf</a></li>
<li><a href="https://github.com/reddavis/TF-IDF" rel="nofollow noreferrer">https://github.com/reddavis/TF-IDF</a></li>
<li><a href="https://github.com/opennorth/tf-idf-similarity" rel="nofollow noreferrer">https://github.com/opennorth/tf-idf-similarity</a></li>
</ul>