Python/Pandas聚合与NLTK相结合

2015-06-02 14:50:54 Business Update Meer cruiseschepen dan ooit in... 2015-06-02 14:50:53 RT @ProvincieNH: Provincie maakt Markermeerdij... 2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat: In ... 2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat http... 2015-06-02 14:50:53 Lugar secreto em Amsterdam: Begijnhof // Hidde... Name: text, Length: 49570

1条回答

网友

1楼 · 发布于 2024-10-01 13:27:00

基于EdChums的评论，这里有一种从CountVectorizer获取（我假设是全局）字数的方法：

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer()

df= pd.DataFrame({'text':['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
              ,'class': ['a','a','a','a','c','c','b','e']})

X = vect.fit_transform(df['text'].values)
y = df['class'].values

将CountVectoriser返回的稀疏矩阵转换为密集矩阵，并将其和特性名称传递给dataframe构造函数。然后将帧转置并沿着axis=1求和得到每个单词的总和：

^{pr2}$

如果您只关心单词的频率分布，请考虑使用Freq Distfrom{}：

import nltk
import itertools
from nltk.probability import FreqDist
texts = ['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']
texts = [nltk.word_tokenize(text) for text in texts]
# collapse into a single list
tokens = list(itertools.chain(*texts))

FD =FreqDist(tokens)

相关问题更多 >

编程相关推荐

热门问题

热门文章