我使用Spyder作为Anaconda的一部分,并尝试按事件类型对tweet(文本)进行分类。为此,我使用了包交叉值得分,已经使用TfidVectorizer对我的tweet进行了向量化,然后使用fit\u transform对unigrams、bigrams和trigrams进行了转换,如下所示:
# TF-IDF on unigrams, bigrams and trigrams
tfidf_words = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1',
ngram_range=(1,1), stop_words='english')
# vectorize for bigrams
tfidf_bigrams = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1',
ngram_range=(2,2), stop_words='english')
# vecorize for trigrams
tfidf_trigrams = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1',
ngram_range=(3,3), stop_words='english')
# Transform and fit each of the outputs from TF-IDF (unigrams, bigrams and trigrams)
x_train_words = tfidf_words.fit_transform(x_train_sm.preprocessed).toarray()
# bigrams
x_train_bigrams = tfidf_bigrams.fit_transform(x_train_sm.preprocessed).toarray()
#trigrams
x_train_trigrams = tfidf_trigrams.fit_transform(x_train_sm.preprocessed).toarray()
现在,我使用包cross_val_score执行交叉验证,以计算unigram、bigrams和trigrams的平均精度。一旦完成,我将尝试为达到的精度生成并保存一个箱线图。对于4种不同的型号,此功能已完成:
^{pr2}$unigrams的输出正是我想要的:
现在,当我运行bigrams和trigrams的代码时(突出显示所有代码并单击“play”),我得到以下结果:
大图:
[
八卦图:
每一个的代码都是相同的,只是它们使用“cv_bigrams”和“cv_trigrams”作为框式图的数据输入。每个代码如下。在
二元代码:
^{3}$三元代码:
# create blank dataframe with an index equal to number of CV folds * number of models tested
cv_trigrams = pd.DataFrame(index=range(CV * len(models)))
# clear the previous list called 'entries' that was populated with values
entries = []
# calculate the accuracy at each fold and populate the results in the 'entries' list
# populate the dataframe 'cv_trigrams' with the fold and accuracy score at each fold
i = 0
for model in models:
#model_name = #model.__class__.__name__
model_name = names[i]
# model => the model that will be used to fit the data
# x_train_trigrams => data that is to be fitted by the selected model (trigrams)
# y_train_sm => y training data after oversampling (event_id)
# scoring => the type of score you want the function 'cross_val_score' to return
# cv = number of folds you want to performed with cross-validation
accuracies = cross_val_score(model, x_train_trigrams, y_train_sm, scoring ='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_trigrams = pd.DataFrame(entries, columns=['model_name_trigrams', 'fold_idx', 'accuracy'])
i = i + 1
以下是如果我只选择以下代码并运行:
# plot the results of each model as a box plot
box_bigrams = sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams)
box_bigrams = sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams)
fig_bigrams = box_bigrams.get_figure()
fig_bigrams.savefig('boxplot_bigrams.png')
同样适用于三角图:
# plot the results of each model as a box plot
box_trigrams = sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams)
box_trigrams = sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams)
fig_trigrams = box_trigrams.get_figure()
fig_trigrams.savefig('boxplot_trigrams.png')
输出:
你知道为什么当我一次运行所有代码时(当我把这段代码投入生产时,我需要这样做),而不是突出显示代码段并单独运行时,为什么我会得到相互重叠的重复boxplots?在
回应@ImportanceOfBeingErnest的评论,你的代码太复杂了,你的问题也不够清楚。你想创建3个不同的图形,每种情况一个(单数、二元和三元)?您是否尝试使用一个具有3个轴的图形(matplotlib中称为子图)?你想把三个箱子并排放在一张图上吗?在
对我来说,最简单的方法是创建一个包含3个子图的图形,如下所示:
请参阅subplots demo here和有关^{} 或^{} 的文档。在the documentation for ^{} 中,您将看到它是一个“轴级别”函数,这意味着您可以要求它在您选择的任何轴对象上绘图
相关问题 更多 >
编程相关推荐