脊分类器:解释mod的答案

2024-10-02 06:26:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我训练过这个模型,在某些情况下我无法解释答案

我创造了玩具火车样品

CourtIDGAS    Addr_upd
03MS0001      usa, new-york, times square, 1
03MS0001      usa, new-york, times square, 3
03MS0001      usa, new-york, times square, 5
03MS0001      usa, new-york, times square, 7
03MS0001      usa, new-york, times square, 9
03MS0001      usa, new-york, times square, 2
03MS0001      usa, new-york, times square, 4
03MS0001      usa, new-york, times square, 6
03MS0001      usa, new-york, times square, 8
03MS0001      usa, new-york, times square, 10
03MS0001      usa, new-york, times square, 12
03MS0002      usa, new-york, times square, 11
03MS0002      usa, new-york, times square, 13
03MS0002      usa, new-york, times square, 14
03MS0002      usa, new-york, times square, 16

我使用CountVectorizer将文本转换为向量,并使用RidgeClassifier预测地址的类别

vec = CountVectorizer(token_pattern='(?u)\\b[а-яё0-9\/\-]+\\b', min_df=1)
X = vec.fit_transform(df.Addr_upd)
Y = df["CourtIDGAS"]
clf = RidgeClassifier(random_state=42)
clf.fit(X, y)

当我试着从火车样本中预测smth时,我得到了正确的答案 但是当我尝试用另一个数据进行预测时,例如usa, new-york, times square, 18,我得到了类03MS0001

我无法解释这一点,因为词汇表中的最大值是16,但这个例子在我心目中更接近03MS0002

如何解释这个量词的答案? 如何正确处理这些数据


Tags: 答案dfnewaddrtimessquareyorkusa

热门问题