擅长:python、mysql、java
<p>实际上我们可以在Pyspark2.2中完成。</p>
<p>首先,我们需要创建一个常量列(“Temp”),groupBy与该列(“Temp”)并应用agg by pass iterable*exprs,collect_list的表达式在其中存在。</p>
<p>下面是代码:</p>
<pre><code>import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
</code></pre>
<p>输入数据:</p>
<pre><code>df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
</code></pre>
<p>输出数据:</p>
<pre><code>df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+
</code></pre>