在pyspark中的列表上应用逻辑运算符

# List of conditions spark_conditions = [cond1, cond2, ..., cond100] # Apply somehow the '|' operator on `spark_conditions` # spark_conditions would look like -> [cond1 | cond2 | .... | cond100] df.select(columns).where(spark_conditions)

2条回答

网友

1楼 · 编辑于 2024-09-30 06:13:39

2e0byo的answer非常正确。我正在添加另一种方法，如何在pyspark中实现这一点

如果我们的条件是SQL条件表达式的字符串（如col_1=='ABC101'），那么我们可以组合所有这些字符串，并将组合后的字符串作为条件提供给where()（或filter()）

df = spark.createDataFrame([(1, "a"),
                            (2, "b"),
                            (3, "c"),
                            (4, "d"),
                            (5, "e"),
                            (6, "f"),
                            (7, "g")], schema="id int, name string")
condition1 = "id == 1"
condition2 = "id == 4"
condition3 = "id == 6"
conditions = [condition1, condition2, condition3]
combined_or_condition = " or ".join(conditions)     # Combine the conditions: condition1 or condition2 or condition3
df.where(combined_or_condition).show()

" or ".join(conditions)通过使用or作为分隔符/连接符/组合器连接conditions中存在的所有字符串来创建字符串。这里，combined_or_condition变成了id == 1 or id == 4 or id == 6

网友

2楼 · 编辑于 2024-09-30 06:13:39

我认为这实际上是一个熊猫问题，因为spark.sql.DataFrame似乎至少表现得像熊猫数据帧。但我不知道斯帕克。在任何情况下，你的“火花条件”实际上是（我认为）布尔级数。我确信有一些方法可以正确地对pandas中的布尔级数求和，但您也可以也将其简化为：

import pandas as pd
from funtools import reduce

df = pd.DataFrame([0,1,2,2,1,4], columns=["num"])
filter1 = df["num"] > 3
filter2 = df["num"] == 2
filter3 = df["num"] == 1
filters = (filter1, filter2, filter3)
filter = reduce(lambda x, y: x | y, filters)
df.filter(filter) # note .where is an alias for .filter

其工作原理如下：reduce()获取过滤器中的前两项内容并在其上运行lambda x, y: x | y。然后它获取的的输出，并将其作为x传递到lambda x, y: x | y，将filters中的第三个条目作为y传递。它一直在走，直到没有什么东西可以带走

因此，净效应是沿着一个可数累积应用一个函数。在这种情况下，函数只返回其输入的|，因此它完全执行您手动执行的操作，但如下所示：

(filter1 | filter2) | filter3

我怀疑有一种更简单的方法可以做到这一点，但reduce有时是值得的Guido doesn't like it though

相关问题更多 >

编程相关推荐

热门问题

热门文章