我是机器学习的新手,正在努力让分类器使用测试数据集进行预测。在
我原以为误差尺寸不匹配是由于向量机与测试集相匹配,但我已经解决了,我仍然有问题。在
错误是由于矢量器被覆盖的某个地方,我相信从调查它,但我找不到在哪里。。。在
如果我在这方面做了很长时间,我将不胜感激:)
import sqlalchemy
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn import metrics
import pickle
### Connect to MYSQL database
##
#
dbServerName = "localhost"
dbUser = "root"
dbPassword = "woodycool123"
dbName = "azure_support_tweets"
engine = sqlalchemy.create_engine('mysql+pymysql://root:woodycool123@localhost:3306/azure_support_tweets')
pd.set_option('display.max_colwidth', -1)
df = pd.read_sql_table("preprocessed_tweets", engine)
data = pd.DataFrame(df)
### Training and Test Data Split
##
#
features_train, features_test, labels_train, labels_test = train_test_split(data['text_tweet'], data['main_category'], random_state = 42, test_size=0.34)
### CountVectorizer
##
#
cv = CountVectorizer(ngram_range=(1,2), stop_words='english', min_df=3, max_df=0.50)
features_train_cv = cv.fit_transform(features_train)
# Uncomment to print a matrix count of tokens
# print(features_train_cv.toarray())
print("Feature Count\nCountVectorizer() #", len(cv.get_feature_names()))
### TF-IDF Transformer
##
#
tfidfv = TfidfTransformer(use_idf=True)
features_train_tfidfv = tfidfv.fit_transform(features_train_cv)
print("Feature Set\nTfidfVectorizer() #", features_train_tfidfv.shape)
# Remove to print the top 10 features
# features = tfidfv.get_feature_names()
# feature_order = np.argsort(tfidfv.idf_)[::-1]
# top_n = 10
# top_n_features = [features[i] for i in feature_order[:top_n]]
# print(top_n_features)
### SelectKBest
##
#
selector = SelectKBest(chi2, k=1000).fit_transform(features_train_tfidfv, labels_train)
print("Feature Set\nSelectKBest() and chi2 #", selector.shape)
### Train Model
##
#
clf = MultinomialNB()
clf.fit(selector, labels_train)
### Test Model
##
#
features_test_cv = cv.transform(features_test)
features_test_cv_two = tfidfv.transform(features_test_cv)
pred = clf.predict(features_test_cv)
错误:
^{pr2}$
看起来你忘了在测试模型部件中使用维度缩减,也就是
SelectKBest
。如果要转换测试数据,我不知道以这种方式使用SelectKBest
是否正确。但不管怎样,朴素的贝叶斯模型等待一个
^{pr2}$selector
形状的东西,即在您的例子中k=1000。在模型的测试部分您跳过了此转换,因此
clf.predict
采用了其他形状的矩阵。尝试使用SelectKBest.transform
获得所需的输出:您也需要通过选择器来通过测试集,但首先必须进行调整
它抛出了这个错误,因为选择器减少了训练集的维数,而不是测试集
相关问题 更多 >
编程相关推荐