检查是否在PySpark数据帧的组中找到值

df = spark.createDataFrame([ ("a", "apple"), ("a", "pear"), ("b", "pear"), ("c", "carrot"), ("c", "apple"), ], ["id", "fruit"]) +---+-------+ | id| fruit| +---+-------+ | a| apple| | a| pear| | b| pear| | c| carrot| | c| apple| +---+-------+

1条回答

网友

1楼 · 发布于 2024-10-02 22:23:34

使用max窗口函数：

df.selectExpr("*", "max(fruit = 'pear') over (partition by id) as flag").show()

+ -+   +  -+
| id| fruit| flag|
+ -+   +  -+
|  c|carrot|false|
|  c| apple|false|
|  b|  pear| true|
|  a| apple| true|
|  a|  pear| true|
+ -+   +  -+

如果需要检查多个水果，可以使用in运算符。例如，要检查carrot和apple：

df.selectExpr("*", "max(fruit in ('carrot', 'apple')) over (partition by id) as flag").show()
+ -+   +  -+
| id| fruit| flag|
+ -+   +  -+
|  c|carrot| true|
|  c| apple| true|
|  b|  pear|false|
|  a| apple| true|
|  a|  pear| true|
+ -+   +  -+

如果您喜欢python语法：

from pyspark.sql.window import Window
import pyspark.sql.functions as f

df.select("*", 
  f.max(
    f.col('fruit').isin(['carrot', 'apple'])
  ).over(Window.partitionBy('id')).alias('flag')
).show()
+ -+   +  -+
| id| fruit| flag|
+ -+   +  -+
|  c|carrot| true|
|  c| apple| true|
|  b|  pear|false|
|  a| apple| true|
|  a|  pear| true|
+ -+   +  -+

相关问题更多 >

编程相关推荐

热门问题

热门文章

检查是否在PySpark数据帧的组中找到值

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >