如何在pysp中对数组中的标签进行编码问题的回答

如何在pysp中对数组中的标签进行编码

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

例如，我在<code>name</code>中具有分类功能的DataFrame： <pre><code> from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").appName("example") .config("spark.some.config.option", "some-value").getOrCreate() features = [(['a', 'b', 'c'], 1), (['a', 'c'], 2), (['d'], 3), (['b', 'c'], 4), (['a', 'b', 'd'], 5)] df = spark.createDataFrame(features, ['name','id']) df.show() </code></pre> 输出： ^{pr2}$ 我想要的是： <pre><code>+--------+--------+--------+--------+----+ | name_a | name_b | name_c | name_d | id | +--------+--------+--------+--------+----+ | 1 | 1 | 1 | 0 | 1 | +--------+--------+--------+--------+----+ | 1 | 0 | 1 | 0 | 2 | +--------+--------+--------+--------+----+ | 0 | 0 | 0 | 1 | 3 | +--------+--------+--------+--------+----+ | 0 | 1 | 1 | 0 | 4 | +--------+--------+--------+--------+----+ | 1 | 1 | 0 | 1 | 5 | +--------+--------+--------+--------+----+ </code></pre> 我找到了<a href="https://stackoverflow.com/questions/53347183/encode-array-of-strings-into-columns-in-pyspark">same queston</a>，但没有任何帮助。我试图从<code>PySpark.ML</code>使用<code>VectorIndexer</code>，但在将<code>name</code>字段转换为<code>vector type</code>时遇到了一些问题。在 <pre><code> from pyspark.ml.feature import VectorIndexer indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5) indexerModel = indexer.fit(df) </code></pre> 我得到以下错误： <pre><code>Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType </code></pre> 我找到了一个解决方案<a href="https://stackoverflow.com/questions/42138482/pyspark-how-do-i-convert-an-array-i-e-list-column-to-vector">here</a>，但看起来过于复杂。但是，我不确定是否只能用<code>VectorIndexer</code>来完成。在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

如何在pysp中对数组中的标签进行编码

1 个回答

相关Python问题