如何在pysp中对数组中的标签进行编码

from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").appName("example") .config("spark.some.config.option", "some-value").getOrCreate() features = [(['a', 'b', 'c'], 1), (['a', 'c'], 2), (['d'], 3), (['b', 'c'], 4), (['a', 'b', 'd'], 5)] df = spark.createDataFrame(features, ['name','id']) df.show()

+--------+--------+--------+--------+----+ | name_a | name_b | name_c | name_d | id | +--------+--------+--------+--------+----+ | 1 | 1 | 1 | 0 | 1 | +--------+--------+--------+--------+----+ | 1 | 0 | 1 | 0 | 2 | +--------+--------+--------+--------+----+ | 0 | 0 | 0 | 1 | 3 | +--------+--------+--------+--------+----+ | 0 | 1 | 1 | 0 | 4 | +--------+--------+--------+--------+----+ | 1 | 1 | 0 | 1 | 5 | +--------+--------+--------+--------+----+

2条回答

网友

1楼 · 编辑于 2024-10-01 07:48:49

如果要将输出与Spark ML一起使用，最好使用CountVectorizer：

from pyspark.ml.feature import CountVectorizer

# Add binary=True if needed
df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")
    .fit(df)
    .transform(df))
df_enc.show(truncate=False)

^{pr2}$

否则收集不同的值：

from pyspark.sql.functions import array_contains, col, explode

names = [
    x[0] for x in 
    df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]

并使用array_contains选择列：

df_sep = df.select("*", *[
    array_contains("name", name).alias("name_{}".format(name)).cast("integer") 
    for name in names]
)
df_sep.show()

+    -+ -+   +   +   +   +
|     name| id|name_a|name_b|name_c|name_d|
+    -+ -+   +   +   +   +
|[a, b, c]|  1|     1|     1|     1|     0|
|   [a, c]|  2|     1|     0|     1|     0|
|      [d]|  3|     0|     0|     0|     1|
|   [b, c]|  4|     0|     1|     1|     0|
|[a, b, d]|  5|     1|     1|     0|     1|
+    -+ -+   +   +   +   +

网友

2楼 · 编辑于 2024-10-01 07:48:49

使用来自pyspark.sql.functions和^{}的^{}：

from pyspark.sql import functions as F
features = [(['a', 'b', 'c'], 1),
             (['a', 'c'], 2),
             (['d'], 3),
             (['b', 'c'], 4),
             (['a', 'b', 'd'], 5)]
df = spark.createDataFrame(features, ['name','id'])
df.show()
+    -+ -+
|     name| id|
+    -+ -+
|[a, b, c]|  1|
|   [a, c]|  2|
|      [d]|  3|
|   [b, c]|  4|
|[a, b, d]|  5|
+    -+ -+

df = df.withColumn('exploded', F.explode('name'))

df.drop('name').groupby('id').pivot('exploded').count().show()
+ -+  +  +  +  +
| id|   a|   b|   c|   d|
+ -+  +  +  +  +
|  5|   1|   1|null|   1|
|  1|   1|   1|   1|null|
|  3|null|null|null|   1|
|  2|   1|null|   1|null|
|  4|null|   1|   1|null|
+ -+  +  +  +  +

按id排序并将null转换为0

^{pr2}$

explode为给定数组或映射中的每个元素返回新行。然后可以使用pivot来“转置”新列。在

相关问题更多 >

编程相关推荐

热门问题

热门文章