CountVectorizer，在第二个tim中使用相同的词汇

+-------+-------------------+-------------------+ |classes|tags |vectors | +-------+-------------------+-------------------+ |0.0 |[happy, food, food]|(3,[0,1],[2.0,1.0])| |0.0 |[dog, food, food] |(3,[0,2],[2.0,1.0])| |1.0 |[food, happy, food]|(3,[0,1],[2.0,1.0])| |1.0 |[food, dog, food] |(3,[0,2],[2.0,1.0])| |6.0 |[food, food, happy]|(3,[0,1],[2.0,1.0])| |6.0 |[food, food, dog] |(3,[0,2],[2.0,1.0])| +-------+-------------------+-------------------+ ['dog', 'food', 'happy']

1条回答

网友

1楼 · 发布于 2024-10-02 12:28:33

从spark2.0开始，它可以在pyspark中使用，它就像持久化和加载其他spark-ml模型一样。在

好，我们先创建一个模型：

from pyspark.ml.feature import CountVectorizer, CountVectorizerModel

# Input data: Each row is a bag of words with a ID.
df = spark.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)

model = cv.fit(df)

result = model.transform(df)
result.show(truncate=False)
# + -+       -+            -+
# |id |words          |features                 |
# + -+       -+            -+
# |0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
# |1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
# + -+       -+            -+

然后坚持下去：

^{pr2}$

现在可以加载并使用它：

same_model = CountVectorizerModel.load("/tmp/count_vec_model")
same_model.transform(df).show(truncate=False)
# + -+       -+            -+
# |id |words          |features                 |
# + -+       -+            -+
# |0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
# |1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
# + -+       -+            -+

有关详细信息，请参阅以下有关Saving and loading spark-ml models/pipelines的文档。在

模型创建代码示例可以在官方文档中找到。在

编辑1

相关问题更多 >

编程相关推荐

热门问题

热门文章