我训练过这个模型,在某些情况下我无法解释答案
我创造了玩具火车样品
CourtIDGAS Addr_upd
03MS0001 usa, new-york, times square, 1
03MS0001 usa, new-york, times square, 3
03MS0001 usa, new-york, times square, 5
03MS0001 usa, new-york, times square, 7
03MS0001 usa, new-york, times square, 9
03MS0001 usa, new-york, times square, 2
03MS0001 usa, new-york, times square, 4
03MS0001 usa, new-york, times square, 6
03MS0001 usa, new-york, times square, 8
03MS0001 usa, new-york, times square, 10
03MS0001 usa, new-york, times square, 12
03MS0002 usa, new-york, times square, 11
03MS0002 usa, new-york, times square, 13
03MS0002 usa, new-york, times square, 14
03MS0002 usa, new-york, times square, 16
我使用CountVectorizer
将文本转换为向量,并使用RidgeClassifier
预测地址的类别
vec = CountVectorizer(token_pattern='(?u)\\b[а-яё0-9\/\-]+\\b', min_df=1)
X = vec.fit_transform(df.Addr_upd)
Y = df["CourtIDGAS"]
clf = RidgeClassifier(random_state=42)
clf.fit(X, y)
当我试着从火车样本中预测smth时,我得到了正确的答案
但是当我尝试用另一个数据进行预测时,例如usa, new-york, times square, 18
,我得到了类03MS0001
我无法解释这一点,因为词汇表中的最大值是16,但这个例子在我心目中更接近03MS0002
如何解释这个量词的答案? 如何正确处理这些数据
目前没有回答
相关问题 更多 >
编程相关推荐