我有三个单词列表,每个单词都属于运动员、喜剧演员和歌手这三个类别。我已经使用TF*IDF加权和sci kit对这3个列表进行了矢量化学习,以获得下面的x\u tfidf矩阵(训练数据):
y = ['Athlete', 'Comedian', 'Singer']
x_tfidf = [[0. 0. 0. 0. 0. 0.01707793
0.17077928 0.01707793 0.01707793 0.01707793 0.0129882 0.01707793
0. 0.02597641 0. 0. 0.01707793 0.
0. 0.06831171 0. 0. 0.0129882 0.03415586
0.01707793 0.01707793 0.03415586 0. 0.01707793 0.
0.0129882 0. 0. 0. 0. 0.
0.01707793 0.01707793 0. 0.01707793 0. 0.01707793
0. 0. 0.01707793 0. 0. 0.
0. 0. 0.01707793 0. 0.0302595 0.
0.01707793 0. 0.02597641 0. 0. 0.
0. 0.03415586 0.01707793 0.55475746 0.01707793 0.
0. 0. 0. 0. 0.01707793 0.
0. 0.01707793 0. 0. 0.01707793 0.
0. 0.03415586 0.06831171 0.01707793 0. 0.03415586
0. 0.01707793 0.0129882 0. 0. 0.01707793
0.05195282 0.02597641 0.020173 0.0129882 0.060519 0.02597641
0. 0.01707793 0. 0.55475746 0.55475746 0.01707793
0. 0.0302595 0.01707793 0. 0. 0.
0. 0.01707793 0. 0.03415586 0. 0.
0. 0.02597641 0.03415586 0.01707793 0. 0.05195282
0. 0. 0. 0. 0. 0.
0.03415586 0. 0.02597641 0.01707793 0. 0.
0. 0. 0.0129882 0. 0.03415586 0.
0.05123378]
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0.00791998 0.00791998 0. 0.
0. 0. 0. 0.03167991 0. 0.01583996
0.00602335 0. 0.00791998 0. 0. 0.
0. 0. 0. 0. 0.00791998 0.
0. 0. 0. 0.00602335 0.00791998 0.00602335
0.00602335 0.00791998 0. 0. 0.014033 0.
0. 0.01583996 0. 0. 0. 0.
0.00791998 0. 0. 0.57535302 0. 0.
0. 0. 0. 0.01807004 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.00791998 0.
0. 0. 0. 0.00791998 0. 0.
0. 0. 0.00467767 0. 0.00467767 0.
0.00791998 0. 0. 0.57535302 0.57535302 0.
0. 0.028066 0. 0. 0.01807004 0.01807004
0.03167991 0. 0.03167991 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0.00791998 0. 0.00602335
0. 0.00791998 0. 0. 0.01807004 0.00791998
0. 0. 0. 0.00791998 0. 0.
0. ]
[0.00527285 0.00527285 0.00175762 0.01230331 0.01230331 0.
0. 0. 0. 0. 0.00133671 0.
0.05800134 0.31546417 0.00175762 0.00351523 0. 0.00175762
0.00175762 0. 0. 0. 0.00133671 0.
0. 0. 0. 0. 0. 0.
0. 0.00175762 0. 0.00527285 0.00175762 0.00175762
0. 0. 0.00175762 0. 0. 0.
0.00175762 0.00527285 0. 0.00133671 0. 0.00133671
0.00133671 0. 0. 0.00175762 0.00103808 0.00175762
0. 0. 0.27268937 0.00351523 0.00351523 0.00175762
0. 0. 0. 0.11937881 0. 0.0105457
0.00527285 0.00175762 0.00175762 0.00133671 0. 0.00175762
0.00175762 0. 0.02460663 0.00527285 0. 0.00175762
0.00175762 0. 0. 0. 0. 0.
0.00175762 0. 0.00401014 0. 0.00175762 0.
0.01737726 0.29675019 0.21591993 0.00133671 0.22214839 0.31412746
0. 0. 0.00175762 0.09654112 0.11937881 0.
0.00351523 0.00207615 0. 0.00527285 0.00133671 0.00133671
0. 0. 0. 0. 0.00351523 0.00175762
0.00175762 0.00133671 0. 0. 0.00527285 0.63360177
0.00175762 0.00703047 0.0105457 0. 0.00351523 0.00935699
0. 0. 0.31412746 0. 0.00133671 0.
0.00175762 0.00175762 0.00133671 0. 0. 0.0105457
0. ]]
我的目标是测试各种分类器,比较sci-kit-learn中各种机器学习算法的输出。也就是说,根据将用作测试数据的单词列表来预测用户是运动员、喜剧演员还是歌手。我尝试使用以下代码使用KNN:
def classify(x_tfidf, y):
knn = neighbors.KNeighborsClassifier()
knn.fit(x_tfidf, y)
但是,我收到以下错误:
Traceback (most recent call last):
File "bow.py", line 115, in <module>
checkExists()
File "bow.py", line 28, in checkExists
get_tags(table)
File "bow.py", line 34, in get_tags
format_tags(data)
File "bow.py", line 56, in format_tags
vectorize(acc_list)
File "bow.py", line 86, in vectorize
classify(x_tag_tfidf, y)
File "bow.py", line 95, in classify
knn.fit(x_tag_tfidf, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 765, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 583, in check_X_y
check_consistent_length(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 204, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 3]
我试图把“y”变成np数组,而np矩阵却没有成功。如果有人能给我指出正确的方向,我将不胜感激。你知道吗
我无法重现您的错误,但当训练样本数小于将要使用的群集中心数(在代码中默认为5)时,我可以生成不同的错误。你知道吗
考虑一个包含更多数据点的随机生成的合成数据集,请注意,正如您所拥有的代码一样,它可以正常工作:
现在请注意,如果我将玩具数据减少到3个示例,我将看到一个错误:
如果我手动插入所需数量的邻居(3),那么它可以工作:
最后,如果您在我的示例中通过
x.tolist()
将x
从numpy ndarray更改为列表列表,则所有操作都是相同的,因此它与对y或x使用list vs.ndarray无关相关问题 更多 >
编程相关推荐