<h2>基本原理</h2>
<p>在研究实际问题之前,让我们先弄清楚定义。在</p>
<p>假设我们的语料库包含3个文档(分别是d1、d2和d3):</p>
<pre><code>corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
</code></pre>
<h2>术语频率(tf)</h2>
<p>tf(单词的)定义为单词在文档中出现的次数。在</p>
^{pr2}$
<p>tf是在文档级别为单词定义的。在</p>
<pre><code>tf('a',d1) = 1 tf('a',d2) = 1 tf('a',d3) = 1
tf('apple',d1) = 1 tf('apple',d2) = 1 tf('apple',d3) = 0
tf('cat',d1) = 0 tf('cat',d2) = 0 tf('cat',d3) = 1
tf('green',d1) = 0 tf('green',d2) = 1 tf('green',d3) = 0
tf('is',d1) = 1 tf('is',d2) = 1 tf('is',d3) = 1
tf('red',d1) = 1 tf('red',d2) = 0 tf('red',d3) = 0
tf('this',d1) = 1 tf('this',d2) = 1 tf('this',d3) = 1
</code></pre>
<p>使用原始计数有一个问题:与较短的文档相比,较长文档中单词的<code>tf</code>值具有较高的值。这个问题可以通过将原始计数值除以文档长度(相应文档中的字数)来规范化来解决。这称为<code>l1</code>规范化。文档<code>d1</code>现在可以用<code>tf vector</code>来表示,其中包含语料库中所有单词的所有<code>tf</code>值。还有一种称为<code>l2</code>的规范化方法,它使文档的tf向量的<code>l2</code>范数等于1。在</p>
<pre><code>tf(word, document, normalize='l1') = count(word, document)/|document|
tf(word, document, normalize='l2') = count(word, document)/l2_norm(document)
</code></pre>
<pre><code>|d1| = 5, |d2| = 5, |d3| = 4
l2_norm(d1) = 0.447, l2_norm(d2) = 0.447, l2_norm(d3) = 0.5,
</code></pre>
<p><strong>代码:tf</strong></p>
<pre><code>corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
# Convert docs to textacy format
textacy_docs = [textacy.Doc(doc) for doc in corpus]
for norm in [None, 'l1', 'l2']:
# tokenize the documents
tokenized_docs = [
doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
for doc in textacy_docs]
# Fit the tf matrix
vectorizer = textacy.Vectorizer(apply_idf=False, norm=norm)
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("TF with {0} normalize".format(norm))
print (doc_term_matrix.toarray())
</code></pre>
<p>输出:</p>
<pre><code>Vocabulary: {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with None normalize
[[1 1 0 0 1 1 1]
[1 1 0 1 1 0 1]
[1 0 1 0 1 0 1]]
Vocabulary: {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l1 normalize
[[0.2 0.2 0. 0. 0.2 0.2 0.2 ]
[0.2 0.2 0. 0.2 0.2 0. 0.2 ]
[0.25 0. 0.25 0. 0.25 0. 0.25]]
Vocabulary: {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l2 normalize
[[0.4472136 0.4472136 0. 0. 0.4472136 0.4472136 0.4472136]
[0.4472136 0.4472136 0. 0.4472136 0.4472136 0. 0.4472136]
[0.5 0. 0.5 0. 0.5 0. 0.5 ]]
</code></pre>
<p><code>tf</code>矩阵中的行对应于文档(因此我们的语料库有3行),列对应于词汇表中的每个单词(词汇词典中显示的单词索引)</p>
<h2>反向文件频率(idf)</h2>
<p>有些词传达的信息比其他词少。例如像,a,an,this这样的词是非常常见的,它们传达的信息非常少。国防军是衡量这个词重要性的尺度。我们认为出现在许多文档中的单词比出现在少数文档中的单词信息量少。在</p>
<pre><code>idf(word, corpus) = log(|corpus| / No:of documents containing word) + 1 # standard idf
</code></pre>
<p>对于我们的语料库来说<code>idf(apple, corpus) < idf(cat,corpus)</code></p>
<pre><code>idf('apple', corpus) = log(3/2) + 1 = 1.405
idf('cat', corpus) = log(3/1) + 1 = 2.098
idf('this', corpus) = log(3/3) + 1 = 1.0
</code></pre>
<p><strong>代码:idf</strong></p>
<pre><code>textacy_docs = [textacy.Doc(doc) for doc in corpus]
tokenized_docs = [
doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
for doc in textacy_docs]
vectorizer = textacy.Vectorizer(apply_idf=False, norm=None)
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("standard idf: ")
print (textacy.vsm.matrix_utils.get_inverse_doc_freqs(doc_term_matrix, type_='standard'))
</code></pre>
<p>输出:</p>
<pre><code>Vocabulary: {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
standard idf:
[1. 1.405 2.098 2.098 1. 2.098 1.]
</code></pre>
<p><strong>术语频率-反向文档频率(tf-idf)</strong>:tf-idf是一个衡量词在语料库中文档中的重要性的指标。单词的tf和它的id加权得到单词的tf-idf度量。在</p>
<pre><code>tf-idf(word, document, corpus) = tf(word, docuemnt) * idf(word, corpus)
</code></pre>
<pre><code>tf-idf('apple', 'd1', corpus) = tf('apple', 'd1') * idf('apple', corpus) = 1 * 1.405 = 1.405
tf-idf('cat', 'd3', corpus) = tf('cat', 'd3') * idf('cat', corpus) = 1 * 2.098 = 2.098
</code></pre>
<p><strong>代码:tf idf</strong></p>
<pre><code>textacy_docs = [textacy.Doc(doc) for doc in corpus]
tokenized_docs = [
doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
for doc in textacy_docs]
print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("tf-idf: ")
vectorizer = textacy.Vectorizer(apply_idf=True, norm=None, idf_type='standard')
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print (doc_term_matrix.toarray())
</code></pre>
<p>输出:</p>
<pre><code>Vocabulary: {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
tf-idf:
[[1. 1.405 0. 0. 1. 2.098 1. ]
[1. 1.405 0. 2.098 1. 0. 1. ]
[1. 0. 2.098 0. 1. 0. 1. ]]
</code></pre>
<h2>现在来回答问题:</h2>
<blockquote>
<p>(1) How can I get the TF-IDF for this word against the corpus, rather
than each character?</p>
</blockquote>
<p>如上所述,没有独立定义的<code>tf-idf</code>,单词的<code>tf-idf</code>与语料库中的文档有关。在</p>
<blockquote>
<p>(2) How can I provide my own corpus and point to it as a param?</p>
</blockquote>
<p>如上述样品所示。在</p>
<ol>
<li>使用将文本文档转换为文本文档文本文档美国石油学会</li>
<li>Tokenzie公司文本文档正在使用tou-terms\u-list方法。(使用此方法,您可以使用在词汇表中添加unigram、bigram或trigram,过滤掉stop words sm noramalize text等)</li>
<li>使用文本。矢量器从标记化的文档创建术语矩阵。返回的矩阵项是
<ul>
<li><code>tf (raw counts): apply_idf=False, norm=None</code></li>
<li><code>tf (l1 normalized): apply_idf=False, norm='l1'</code></li>
<li><code>tf (l2 normalized): apply_idf=False, norm='l2'</code></li>
<li><code>tf-idf (standard): apply_idf=True, idf_type='standard'</code></li>
</ul></li>
</ol>
<blockquote>
<p>(3) Can TF-IDF be used at a sentence level? ie: what is the relative
frequency of this sentence's terms against the corpus.</p>
</blockquote>
<p>是的,你可以,如果而且只有当你把每句话都当作一个单独的文件。在这种情况下,对应文档的<code>tf-idf</code>向量(整行)可以被视为文档的向量表示(在您的例子中是一个句子)。在</p>
<p>对于我们的语料库(事实上每个文档包含一个句子),d1和d2的向量表示应该与向量d1和d3更接近。让我们检查余弦相似度,看看:</p>
<pre><code>cosine_similarity(doc_term_matrix)
</code></pre>
<p>输出</p>
<pre><code>array([[1. , 0.53044716, 0.35999211],
[0.53044716, 1. , 0.35999211],
[0.35999211, 0.35999211, 1. ]])
</code></pre>
<p>如你所见,余弦相似性(d1,d2)=0.53,余弦相似性(d1,d3)=0.35,因此d1和d2确实比d1和d3更相似(1表示完全相似,0表示不相似-正交向量)。在</p>
<p>一旦你训练了你的<code>Vectorizer</code>,你就可以把训练过的对象保存到磁盘上以备以后使用。在</p>
<h2>结论</h2>
<p>单词的<code>tf</code>位于文档级别,<code>idf</code>位于语料库级别,单词的{<cd10>}相对于语料库位于文档级别。它们非常适合于文档(或者当文档由一句话)。如果您对单词的向量表示感兴趣,那么可以探索单词嵌入,比如(word2vec、fasttext、glow等)。在</p>