从.txt fi确定每个tweet的tfidf的Python代码

2024-09-30 00:25:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我希望你能帮助我从一个.txt文件中读取行(将这些文件作为单独的文档处理)并确定每条tweet的tf-idf。在

# -*- coding: utf-8 -*-
from __future__ import division, unicode_literals 
import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

document1 = tb("""RT @brides: These are 5 hidden jobs no one one tells about one maids-of-honor one about. You're welcome: jobs http://t.co/qybBewFDre
This brides week on brides twitter: One new brides follower via http://t.co/0NP5Wz70Op""")

document2 = tb("""Python, from the Greek word (Ï€Ïθων/Ï€Ïθωνας), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known.""")

document3 = tb("""The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now   discontinued
Colt Python targeted the premium revolver market segment. Some firearm
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy
Thompson, Renee Smeets and Martin Dougherty have described the Python as the
finest production revolver ever made.""")

bloblist = [document1, document2, document3]
for i, blob in enumerate(bloblist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:3]:
        print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))

Tags: oftheinforreturnisdefas
1条回答
网友
1楼 · 发布于 2024-09-30 00:25:59

我不确定我是否正确地理解了你。在

file_names = ['file1.txt','file2.txt']
#open files
files =  map(open,file_names)
#read files
documents = [file.read() for file in files]
#close files
[file.close() for file in files]
#create blobs
bloblist = map(tb,documents)

有关读写文件的详细信息,请在此处找到:https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

您可以从以下文件解析字符串:

^{pr2}$

相关问题 更多 >

    热门问题