基于python的tfidf数据帧

2024-09-28 03:11:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我要把一些情绪分类我的数据框架是这样的

Phrase                      Sentiment

is it  good movie          positive

wooow is it very goode      positive

bad movie                  negative

我做了一些预处理作为标记停止词干等。。。我得到了

^{pr2}$

最后我需要得到一个数据帧,它的行是文本,值是tf_idf,列是这样的单词

good     movie   wooow    very      bad                Sentiment

tf idf    tfidf_  tfidf    tf_idf    tf_idf               positive

(剩下的两行也是一样)


Tags: 数据istf分类itmovieverytfidf
2条回答

我将使用sklearn.feature_extraction.text.TfidfVectorizer,它是专门为此类任务设计的:

演示:

In [63]: df
Out[63]:
                   Phrase Sentiment
0       is it  good movie  positive
1  wooow is it very goode  positive
2               bad movie  negative

解决方案:

^{pr2}$

结果:

In [31]: r.join(df)
Out[31]:
  Sentiment  bad  good     goode     wooow
0  positive  0.0   1.0  0.000000  0.000000
1  positive  0.0   0.0  0.707107  0.707107
2  negative  1.0   0.0  0.000000  0.000000

更新:内存节省解决方案:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')

X = vect.fit_transform(df.pop('Phrase')).toarray()

for i, col in enumerate(vect.get_feature_names()):
    df[col] = X[:, i]

更新2:related question where the memory issue was finally solved

设置

df = pd.DataFrame([
        [['good', 'movie'], 'positive'],
        [['wooow', 'is', 'it', 'very', 'good'], 'positive'],
        [['bad', 'movie'], 'negative']
    ], columns=['Phrase', 'Sentiment'])

df

                        Phrase Sentiment
0                [good, movie]  positive
1  [wooow, is, it, very, good]  positive
2                 [bad, movie]  negative

计算term frequency ^{}

^{pr2}$

正在计算inverse document frequency ^{}

# add one to numerator and denominator just incase a term isn't in any document
# maximum value is log(N) and minimum value is zero
idf = np.log((len(df) + 1 ) / (tf.gt(0).sum() + 1))
idf

bad      0.693147
good     0.287682
is       0.693147
it       0.693147
movie    0.287682
very     0.693147
wooow    0.693147
dtype: float64

tfidf

tdf * idf

        bad      good        is        it     movie      very     wooow
0  0.000000  0.287682  0.000000  0.000000  0.287682  0.000000  0.000000
1  0.000000  0.287682  0.693147  0.693147  0.000000  0.693147  0.693147
2  0.693147  0.000000  0.000000  0.000000  0.287682  0.000000  0.000000

相关问题 更多 >

    热门问题