计算文本中单个单词的TDIDF问题的回答

计算文本中单个单词的TDIDF

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

你可以用TF-IDF来查语料库。在 <pre><code>docs = ['this is me','this was not that you thought', 'lets test them'] ## create a list of documents from sklearn.feature_extraction.text import TfidfVectorizer vec = TfidfVectorizer() vec.fit(docs) ##fit your documents print(vec.vocabulary_) #print vocabulary, don't run for 2.5 million documents </code></pre> Output：包含每个单词的idf，并在输出中为其分配一个唯一的索引 ^{pr2}$ 输出：打印每个词汇表单词的idf值 <pre><code>[ 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.28768207 1.69314718 1.69314718 1.69314718] </code></pre> 现在根据你的问题，假设你想找tf-idf，那么你可以得到： <pre><code>word = 'thought' #example index = vec.vocabulary_[word] >8 print(vec.idf_[index]) #prints idf value >1.6931471805599454 </code></pre> 参考文献： 1<a href="https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/" rel="nofollow noreferrer">prepare-text</a> 现在对文本进行同样的操作 <pre><code>import spacy nlp = spacy.load('en') ## install it by python -m spacy download en (run as administrator) doc_strings = [ 'this is me','this was not that you thought', 'lets test them' ] docs = [nlp(string.lower()) for string in doc_strings] corpus = textacy.Corpus(nlp,docs =docs) vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth') doc_term_matrix = vectorizer.fit_transform((doc.to_terms_list(ngrams=1, normalize='lower',as_strings=True,filter_stops=False) for doc in corpus)) print(vectorizer.terms_list) print(doc_term_matrix.toarray()) </code></pre> 输出 <pre><code>['is', 'lets', 'me', 'not', 'test', 'that', 'them', 'this', 'thought','was', 'you'] [[1.69314718 0. 1.69314718 0. 0. 0. 0. 1.28768207 0. 0. 0. ] [0. 0. 0. 1.69314718 0. 1.69314718 0. 1.28768207 1.69314718 1.69314718 1.69314718] [0. 1.69314718 0. 0. 1.69314718 0. 1.69314718 0. 0. 0. 0. ]] </code></pre> 参考号：<a href="https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#make-a-corpus" rel="nofollow noreferrer">link</a>

计算文本中单个单词的TDIDF

1 个回答

相关Python问题