我一直在使用一段代码,这段代码是我使用scikit:https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f在多分类上在线找到的。我一直在使用我自己的数据集,其中包括与仇恨言论相关的推特,无论如何,我一直在努力寻找与我拥有的每个标签最相关的术语。标签标识为仇恨和非仇恨。我得到的两个标签最相关的单图和双图的结果是完全相同的。我只是想知道怎么了?我尝试过使用网站提供的数据集,它在这方面非常有效
我的结果如下所示:
# 'Non-Hate':
. Most correlated unigrams:
. idiot
. stupid
. Most correlated bigrams:
. fucking idiot
. fucking bitch
# 'Non-Hate':
. Most correlated unigrams:
. idiot
. stupid
. Most correlated bigrams:
. fucking idiot
. fucking bitch
使用的代码是:
df['category_id'] = df['Code'].factorize()[0]
category_id_df = df[['Code', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Code']].values)
df.head()
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=3, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.Tweet).toarray()
labels = df.category_id
features.shape
from sklearn.feature_selection import chi2
import numpy as np
N = 2
for Code, category_id in sorted(category_to_id.items()):
features_chi2 = chi2(features, labels == category_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}':".format(Code))
print(" . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
print(" . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))
只有两个类,它们应该是相同的。卡方检验是发现两类之间最有区别(在某种意义上)的特征。您的引用是不同的,因为所使用的目标(
labels == category_id
)是一对一的区别。非常表明而不是在某个类别中的单格/双格图仍将具有该类别的高chi2测试值相关问题 更多 >
编程相关推荐