from pyspark.ml.linalg import Vectors, VectorUDT
import pyspark.sql.functions as F
def max_binarizer(vector):
max_val = max(vector) # maximum value in the vector
return Vectors.dense([1 if x == max_val else 0 for x in vector]) # binarize it
# create a udf for the binarizer
max_bin_udf = F.udf(max_binarizer, VectorUDT())
df.withColumn("vector", max_bin_udf(df["vector"])).show()
+ + -+
| Col1| vector|
+ + -+
|Modali|[0.0,0.0,1.0]|
|assert|[0.0,1.0,0.0]|
+ + -+
您可以创建一个
udf
,它接受一个向量并对其进行二值化;可以通过简单地使用列表理解检查向量中的值是否等于最大值来构造二进制化器:相关问题 更多 >
编程相关推荐