Tf-Idf在一组文件上查找相关的单词

2024-09-24 02:14:46 发布

男 | 程序猿一只，喜欢编程写python代码。

我有2本txt格式的书（6000多行）。我想用Python将每个单词的相关性（使用td-idf算法）关联起来，并按降序排列。我试过这个密码

#- * -coding: utf - 8 - * -
    from __future__
import division, unicode_literals
import math
from textblob
import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1
        for blob in bloblist
        if word in blob)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

document1 = tb(""
    "FULL BOOK1 TEST"
    "")
document2 = tb(""
    "FULL BOOK2 TEST"
    "")



bloblist = [document1, document2]
for i, blob in enumerate(bloblist):
    with open("result.txt", 'w') as textfile:
    print("Top words in document {}".format(i + 1))
scores = {
    word: tfidf(word, blob, bloblist) for word in blob.words
}
sorted_words = sorted(scores.items(), key = lambda x: x[1], reverse = True)
for word, score in sorted_words:
    textfile.write("Word: {}, TF-IDF: {}".format(word, round(score, 5)) + "\n")

我在这里发现的https://stevenloria.com/tf-idf/有一些变化，但这需要很多时间，几分钟后，它会崩溃说TypeError: coercing to Unicode: need string or buffer, float found。为什么？在

我还试图通过pythonhttps://github.com/mccurdyc/tf-idf/调用这个Java程序。有一个高关联度的词被归类为0，而不是一个高关联度的作品。在

有没有办法修复Python代码？或者，你能建议我另一个tf-idf实现，它能正确地实现我想要的功能吗？在

Tags： in from import txt for return tf def

0条回答

目前没有回答

Tf-Idf在一组文件上查找相关的单词

相关问题更多 >

编程相关推荐

热门问题

热门文章

Tf-Idf在一组文件上查找相关的单词

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >