将多个列与另一个单列进行比较时，选择立即较小/较大的值

d = [ {'id': 500, 'text1': 1000 ,'text2': 2000 ,'text3': 3000, 'text4': 5000}, {'id': 1500, 'text1': 1000 ,'text2': 2000 ,'text3': 3000, 'text4': 5000}, {'id': 2500, 'text1': 1000 ,'text2': 2000 ,'text3': 3000, 'text4': 5000}, {'id': 3500, 'text1': 1000 ,'text2': 2000 ,'text3': 3000, 'text4': 5000}, {'id': 4500, 'text1': 1000 ,'text2': 2000 ,'text3': 3000, 'text4': 5000}, {'id': 5500, 'text1': 1000 ,'text2': 2000 ,'text3': 3000, 'text4': 5000} ] data = spark.createDataFrame(d)

3条回答

网友

1楼 · 编辑于 2024-06-01 22:27:32

下面是使用^{}和^{}函数以及^{}表达式的另一种方法：

lowerBound=maxthresh_cols该状态为thresh_col < id
upperBound=minthresh_cols该状态满足条件{}

from pyspark.sql import functions as F

result = data.withColumn(
    'lowerBound',
    F.array_max(F.array(*[F.when(F.col(c) < F.col('id'), F.col(c)) for c in thresh_cols]))
).withColumn(
    'upperBound',
    F.array_min(F.array(*[F.when(F.col(c) > F.col('id'), F.col(c)) for c in thresh_cols]))
)

result.show()

#+  +  -+  -+  -+  -+     +     +
#|  id|text1|text2|text3|text4|lowerBound|upperBound|
#+  +  -+  -+  -+  -+     +     +
#| 500| 1000| 2000| 3000| 5000|      null|      1000|
#|1500| 1000| 2000| 3000| 5000|      1000|      2000|
#|2500| 1000| 2000| 3000| 5000|      2000|      3000|
#|3500| 1000| 2000| 3000| 5000|      3000|      5000|
#|4500| 1000| 2000| 3000| 5000|      3000|      5000|
#|5500| 1000| 2000| 3000| 5000|      5000|      null|
#+  +  -+  -+  -+  -+     +     +

网友

2楼 · 编辑于 2024-06-01 22:27:32

下面是一种使用spark高阶函数的方法>=2.4：

df_cols = data.columns
thresh_list = [x for x in df_cols if x.startswith('text')]

out = (data.select("*",F.sort_array(F.array(*thresh_list)).alias("Arr"))
.withColumn("FirstVal",F.expr('element_at(filter (Arr, x-> x<id),-1)'))
.withColumn("LastVal",F.expr('filter (Arr, x->x>id)[0]')).drop("Arr")
)

out.show(truncate=False)

+  +  -+  -+  -+  -+    +   -+
|id  |text1|text2|text3|text4|FirstVal|LastVal|
+  +  -+  -+  -+  -+    +   -+
|500 |1000 |2000 |3000 |5000 |null    |1000   |
|1500|1000 |2000 |3000 |5000 |1000    |2000   |
|2500|1000 |2000 |3000 |5000 |2000    |3000   |
|3500|1000 |2000 |3000 |5000 |3000    |5000   |
|4500|1000 |2000 |3000 |5000 |3000    |5000   |
|5500|1000 |2000 |3000 |5000 |5000    |null   |
+  +  -+  -+  -+  -+    +   -+

网友

3楼 · 编辑于 2024-06-01 22:27:32

您可以使用least和greatest获取相关列：

import pyspark.sql.functions as F

df = data.withColumn(
    'col1',
    F.greatest(*[
        F.when(F.col(c) < F.col('id'), F.col(c))
        for c in data.columns
    ])
).withColumn(
    'col2',
    F.least(*[
        F.when(F.col(c) > F.col('id'), F.col(c))
        for c in data.columns
    ])
)

df.show()
+  +  -+  -+  -+  -+  +  +
|  id|text1|text2|text3|text4|col1|col2|
+  +  -+  -+  -+  -+  +  +
| 500| 1000| 2000| 3000| 5000|null|1000|
|1500| 1000| 2000| 3000| 5000|1000|2000|
|2500| 1000| 2000| 3000| 5000|2000|3000|
|3500| 1000| 2000| 3000| 5000|3000|5000|
|4500| 1000| 2000| 3000| 5000|3000|5000|
|5500| 1000| 2000| 3000| 5000|5000|null|
+  +  -+  -+  -+  -+  +  +

然后你可以对col1和col2进行操作

相关问题更多 >

编程相关推荐

热门问题

热门文章