我正在使用pyspark Word2Vec教程和一些twitter数据来构建一个向量,以便将来在KMeans中使用
当我运行synonyms = model.findSynonyms('привет', 5)
时,它会引发py4j.protocol.Py4JJavaError:
我试过使用:
synonyms = model.findSynonyms(u'привет'.encode('utf-8'), 10)
synonyms = model.findSynonyms(u'привет'.decode('utf-8'), 10)
synonyms = model.findSynonyms(u'\xd0\xbf\xd0\xb8\xd0\xb7\xd0\xb4\xd0\xb5\xd1\x86'.encode('utf-8'), 10)
synonyms = model.findSynonyms(u'\xd0\xbf\xd0\xb8\xd0\xb7\xd0\xb4\xd0\xb5\xd1\x86', 10)
inp = sc.textFile("data/mllib/sample_lda_data.txt").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)
synonyms = model.findSynonyms('1', 5)
for word, cosine_distance in synonyms:
print("{}: {}".format(word, cosine_distance))
期望值:
>>> for word, cosine_distance in synonyms:
... print("{}: {}".format(word.encode('utf-8'), cosine_distance))
...
look: 0.91164034605
phone: 0.910009503365
Been: 0.90544962883
number.: 0.904221653938
Look: 0.903845191002
但我无法到达那里,因为findSynonyms()不适用于西里尔文字
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/pyspark/mllib/feature.py", line 611, in findSynonyms
words, similarity = self.call("findSynonyms", word, num)
File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/pyspark/mllib/common.py", line 146, in call
return callJavaFunc(self._sc, getattr(self._java_model, name), *a)
File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/pyspark/mllib/common.py", line 123, in callJavaFunc
return _java2py(sc, func(*args))
File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: <exception str() failed>
目前没有回答
相关问题 更多 >
编程相关推荐