VectorAssembler需要输入什么数据类型？

from pyspark.ml.feature import VectorAssembler df = spark.createDataFrame([([1, 2, 3], 0, 3)], ["a", "b", "c"]) vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"]) vecAssembler.transform(df).show()

1条回答

网友

1楼 · 发布于 2024-09-24 06:29:58

根据docs报告

VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.

因此，首先需要将数组列转换为向量列（方法来自this question）

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
list_to_vector_udf = udf(lambda l: Vectors.dense(l), VectorUDT())
df_with_vectors = df.withColumn('a', list_to_vector_udf('a'))

然后可以使用向量汇编程序：

vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])

vecAssembler.transform(df_with_vectors).show(truncate=False)
+      -+ -+ -+          -+
|a            |b  |c  |features             |
+      -+ -+ -+          -+
|[1.0,2.0,3.0]|0  |3  |[1.0,2.0,3.0,0.0,3.0]|
+      -+ -+ -+          -+

相关问题更多 >

编程相关推荐

热门问题

热门文章

VectorAssembler需要输入什么数据类型？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >