pysp中OneHotEncoder的向量大小错误

from pyspark.ml.feature import OneHotEncoder, StringIndexer df = sqlContext.createDataFrame([ (0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c") ], ["id", "category"]) stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex") model = stringIndexer.fit(df) indexed = model.transform(df) encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec") encoded = encoder.transform(indexed) encoded.show()

1条回答

网友

1楼 · 发布于 2024-09-24 06:35:46

来自^{}的文档：

class pyspark.ml.feature.OneHotEncoder(dropLast=True, inputCol=None, outputCol=None)
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

所以对于n类别，您将拥有一个大小为n-1的输出向量，除非您将dropLast设置为False。这并没有什么错或奇怪的地方-您只需要n-1索引来唯一地映射所有类别。在

相关问题更多 >

编程相关推荐

热门问题

热门文章