检查是否在PySpark数据帧的组中找到值

2024-10-02 22:23:34 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有下面的df

df = spark.createDataFrame([
  ("a", "apple"),
  ("a", "pear"),
  ("b", "pear"),
  ("c", "carrot"),
  ("c", "apple"),
], ["id", "fruit"])

+---+-------+
| id|  fruit|
+---+-------+
|  a|  apple|
|  a|   pear|
|  b|   pear|
|  c| carrot|
|  c|  apple| 
+---+-------+

现在,我想为每个id创建一个布尔标志,该标志为TRUE,该id在果列fruit中至少有一列"pear"

所需的输出如下所示:

+---+-------+------+
| id|  fruit|  flag|
+---+-------+------+
|  a|  apple|  True|
|  a|   pear|  True|
|  b|   pear|  True|
|  c| carrot| False|
|  c|  apple| False|
+---+-------+------+

对于熊猫,我找到了一个带有groupby().transform(){a1}的解决方案,但我不知道如何将其转换为PySpark


Tags: idfalsetrueappledf标志sparkflag
1条回答
网友
1楼 · 发布于 2024-10-02 22:23:34

使用max窗口函数:

df.selectExpr("*", "max(fruit = 'pear') over (partition by id) as flag").show()

+ -+   +  -+
| id| fruit| flag|
+ -+   +  -+
|  c|carrot|false|
|  c| apple|false|
|  b|  pear| true|
|  a| apple| true|
|  a|  pear| true|
+ -+   +  -+

如果需要检查多个水果,可以使用in运算符。例如,要检查carrotapple

df.selectExpr("*", "max(fruit in ('carrot', 'apple')) over (partition by id) as flag").show()
+ -+   +  -+
| id| fruit| flag|
+ -+   +  -+
|  c|carrot| true|
|  c| apple| true|
|  b|  pear|false|
|  a| apple| true|
|  a|  pear| true|
+ -+   +  -+

如果您喜欢python语法:

from pyspark.sql.window import Window
import pyspark.sql.functions as f

df.select("*", 
  f.max(
    f.col('fruit').isin(['carrot', 'apple'])
  ).over(Window.partitionBy('id')).alias('flag')
).show()
+ -+   +  -+
| id| fruit| flag|
+ -+   +  -+
|  c|carrot| true|
|  c| apple| true|
|  b|  pear|false|
|  a| apple| true|
|  a|  pear| true|
+ -+   +  -+

相关问题 更多 >