为什么TFIDF计算要花费这么多时间？

# Calculating the Term Frequency, Inverse Document Frequency score import os import math from textblob import TextBlob as tb def tf(word, blob): return tb(blob).words.count(word) / len(tb(blob).words) def n_containing(word, bloblist): return sum(1 for blob in bloblist if word in tb(blob).words) def idf(word, bloblist): return math.log(len(bloblist) / (1 + n_containing(word, bloblist))) def tfidf(word, blob, bloblist): return tf(word, blob) * idf(word, bloblist) # Stemming the articles from nltk.stem import PorterStemmer port = PorterStemmer() bloblist = [] doclist = [pdf1, pdf2, pdf3] # Defined earlier, not showing here as it is not relevant to the question for doc in doclist: bloblist.append(port.stem(str(doc))) # TF-IDF calculation on the stemmed articles for index, blob in enumerate(bloblist): print("Top words in document {}".format(index + 1)) scores = {word: tfidf(word, blob, bloblist) for word in tb(blob).words} sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) i=1 for word, score in sorted_words[:5]: print("\tWord "+str(i)+": {}, TF-IDF: {}".format(word, round(score, 5))) i+=1

2条回答

网友

1楼 · 编辑于 2024-09-24 02:23:07

正如另一个答案所提到的，您调用tb(blob)太多了；对于一个包含N个单词的文档，您调用它的次数超过了N^2次。这总是很慢的。你需要做出这样的改变：

for index, blob in enumerate(bloblist):
    print("Top words in document {}".format(index + 1))
    # XXX use textblob here just once
    tblob = tb(blob)
    scores = {word: tfidf(word, tblob, bloblist) for word in tblob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    i=1
    for word, score in sorted_words[:5]:
        print("\tWord "+str(i)+": {}, TF-IDF: {}".format(word, round(score, 5)))
        i+=1

您还需要更改tfidf函数，以便它们每次都使用tblob，而不是调用tb(blob)。在

网友

2楼 · 编辑于 2024-09-24 02:23:07

草草一看，有些东西就会蹦出来， 1）在不了解tb方法是如何实现的情况下，您似乎在为每个单词调用tb(blob)。{cd1>每一个对象返回一次。 2） nltk有自己的tfidf实现，这将更加优化，并且可以加快速度。 3）您可以使用numpy而不是普通的python来实现，这肯定会加快速度向上。但是即使这样，最好缓存结果并使用它们，而不是多次调用一个可能很重的函数。在

相关问题更多 >

编程相关推荐

热门问题

热门文章