擅长:python、mysql、java
<p><em>有人向我提出了如下备选答案:</em></p>
<hr/>
<p>利用<code>collect_list</code>和<code>explode</code>:</p>
<pre><code>df1 = spark.createDataFrame([ ('1234','banana','Paris'),
('1235','orange','Berlin'),
('1236','orange','Paris'),
('1237','banana','Berlin'),
('1238','orange','Paris'),
('1239','banana','Berlin'),
], ["A","B","C"])
from pyspark.sql import Window as W, functions as F
df = df1.groupBy("B", "C").agg(F.collect_list("A").alias("A"))\
.withColumn("id", F.rand())\
.withColumn("id", F.row_number().over(W.partitionBy().orderBy("id")) % 3)\
.withColumn("A", F.explode("A"))\
df.show()
+ + + + -+
| B| C| A| id|
+ + + + -+
|banana|Berlin|1237| 1|
|banana|Berlin|1239| 1|
|orange|Berlin|1235| 2|
|orange| Paris|1236| 0|
|orange| Paris|1238| 0|
|banana| Paris|1234| 1|
+ + + + -+
</code></pre>
<hr/>
<p>结果与PySpark Helper提供的答案基本相同</p>