pysp中OneHotEncoder的向量大小错误

2024-09-24 06:35:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图检查pyspark中OneHotEncoder的输出。我在论坛和编码器文档中读到,编码向量的大小将等于正在编码的列中不同值的数量。在

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category",   outputCol="categoryIndex")

model = stringIndexer.fit(df)

indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")

encoded = encoder.transform(indexed)
encoded.show()

以下是上述代码的结果

^{pr2}$

根据categoryVec列的解释,向量的大小为2。然而,“类别”列中不同值的数目是3,即a、b和c。请让我了解我在这里遗漏了什么。在


Tags: 编码dfencodermodeltransform向量indexedpyspark
1条回答
网友
1楼 · 发布于 2024-09-24 06:35:46

来自^{}的文档:

class pyspark.ml.feature.OneHotEncoder(dropLast=True, inputCol=None, outputCol=None)

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

所以对于n类别,您将拥有一个大小为n-1的输出向量,除非您将dropLast设置为False。这并没有什么错或奇怪的地方-您只需要n-1索引来唯一地映射所有类别。在

相关问题 更多 >