<p>使用来自<code>pyspark.sql.functions</code>和<a href="http://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html#pyspark.sql.GroupedData.pivot" rel="nofollow noreferrer">^{<cd3>}</a>的<a href="http://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html#pyspark.sql.functions.explode" rel="nofollow noreferrer">^{<cd1>}</a>:</p>
<pre><code>from pyspark.sql import functions as F
features = [(['a', 'b', 'c'], 1),
(['a', 'c'], 2),
(['d'], 3),
(['b', 'c'], 4),
(['a', 'b', 'd'], 5)]
df = spark.createDataFrame(features, ['name','id'])
df.show()
+ -+ -+
| name| id|
+ -+ -+
|[a, b, c]| 1|
| [a, c]| 2|
| [d]| 3|
| [b, c]| 4|
|[a, b, d]| 5|
+ -+ -+
df = df.withColumn('exploded', F.explode('name'))
df.drop('name').groupby('id').pivot('exploded').count().show()
+ -+ + + + +
| id| a| b| c| d|
+ -+ + + + +
| 5| 1| 1|null| 1|
| 1| 1| 1| 1|null|
| 3|null|null|null| 1|
| 2| 1|null| 1|null|
| 4|null| 1| 1|null|
+ -+ + + + +
</code></pre>
<p>按<code>id</code>排序并将<code>null</code>转换为0</p>
^{pr2}$
<p><code>explode</code>为给定数组或映射中的每个元素返回新行。然后可以使用<code>pivot</code>来“转置”新列。在</p>