如何获得最重要单词的TFIDF分数?

2024-09-30 16:32:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用tf idf进行一个项目,我的数据框架中有一个列(df['liststring']),其中包含来自各种文档的预处理文本(没有标点符号、停止词等)

我运行了下面的代码,得到了tf idf值最高的前10个单词,但我也想看看它们的分数

    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidf = TfidfVectorizer()
    X_tfidf = tfidf.fit_transform(df['liststring']).toarray()
    vocab = tfidf.vocabulary_
    reverse_vocab = {v:k for k,v in vocab.items()}
    feature_names = tfidf.get_feature_names()
    df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)
    idx = X_tfidf.argsort(axis=1)
    tfidf_max10 = idx[:,-10:]
    df_tfidf['top10'] = [[reverse_vocab.get(item) for item in row] for row in tfidf_max10 ]
    
df_tfidf['top10']

0      [kind, pose, world, preventive, sufficient, ke...
1      [mode, california, diseases, evidence, zoonoti...
2      [researcher, commentary, allegranzi, say, mora...
3      [carry, mild, man, whatever, suffering, downpl...
4      [region, service, almost, wednesday, detect, f...
                             ...                        
754    [americans, plan, year, black, online, shop, s...
755    [relate, manor, tuesday, death, portobello, ce...
756    [one, october, eight, exist, transmit, cluster...
757    [wolfe, shelter, county, resident, cupertino, ...
758    [firework, year, blasio, day, marching, reimag...

如果我们以第一行为例,而不是[kind,pose,world,preventive,fully,ke…],我希望输出像[kind:0.2,pose:0.3,world:0.4,preventive:0.5,fully:0.6,ke…]


Tags: indfforworldnamestffeaturetfidf
1条回答
网友
1楼 · 发布于 2024-09-30 16:32:15
df_tfidf['top10'] = [[(reverse_vocab.get(item), X_tfidf[i, item])  for item in row] 
                     for i, row in enumerate(tfidf_max10) ]

测试用例:

df = pd.DataFrame(
    {'liststring': ['this is a cat', 'that is a dog', "a apple on the tree"]}
)
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['liststring']).toarray()
vocab = tfidf.vocabulary_
reverse_vocab = {v:k for k,v in vocab.items()}
feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)
idx = X_tfidf.argsort(axis=1)
tfidf_max2 = idx[:,-2:]
print ([[(reverse_vocab.get(item), X_tfidf[i, item])  for item in row] 
                     for i, row in enumerate(tfidf_max2) ])

输出:

[[('cat', 0.6227660078332259), ('this', 0.6227660078332259)],
 [('dog', 0.6227660078332259), ('that', 0.6227660078332259)], 
 [('the', 0.5), ('tree', 0.5)]]

相关问题 更多 >