擅长:python、mysql、java
<p>您可以使用窗口函数,通过对status的值进行排序,为每个唯一键选择最高值</p>
<p>PS:我用scala编写代码</p>
<pre><code>import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "OAOS-STP"),
(1, "OAOS-nonSTP"),
(1, "manual"),
(2, "OAOS-nonSTP"),
(2, "manual"),
(3, "OAOS-STP"),
(3, "OAOS-nonSTP"),
(4, "OAOS-STP"),
(4, "manual")
).toDF("unique-id", "status")
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("lower_status", lower($"status"))
val windowSpec = Window.partitionBy("unique-id").orderBy("status")
val df3 = df2
.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1)
.drop("rank")
.drop("lower_status")
</code></pre>
<p><code>df3.show(false)</code>的输出将是</p>
<p><a href="https://i.stack.imgur.com/FD02p.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/FD02p.png" alt="output"/></a></p>