从整数列表中选择值的新列

|ID |YearBLT|MinYear|MaxYear|ADP_Range | --------------------------------------------------------- |164876|2010 |2004 |2009 |[2004,2009] | |164877|2008 |2000 |2011 |[2000, 2002, 2011] | |164878|2000 |2003 |2011 |[2003, 2011] | |164879|2013 |1999 |2015 |[2003, 2007, 2015, 1999]|

1条回答

网友

1楼 · 发布于 2024-10-01 17:21:46

首先让我们找出最大范围

from pyspark.sql.functions import array_max, col, expr, when

max_adp_range = array_max("ADP_Range")

最接近的值：

^{pr2}$

把这两个结合成一个表达式：

adp_year = when(
    # If the YearBLT is greater than the MaxYear, ADP_Year == Max(ADP_Range)
    col("YearBLT") > col("MaxYear"), max_adp_range
).when(
    # If the YearBLT is in between, it chooses 
    # the closest date below the YearBLT in the ADP_Range
    col("YearBLT").between(col("MinYear"), col("MaxYear")), closest_adp_range
).otherwise(
   # If the YearBLT is less than the MinYear, ADP_Year == "NA"
   # Note: not required. Included just for clarity.
   None
)

最后选择：

df = spark.createDataFrame([                                    
    (164876, 2010, 2004, 2009, [2004,2009]),
    (164877, 2008, 2000, 2011, [2000, 2002, 2011]),   
    (164878, 2000, 2003, 2011, [2003, 2011]),         
    (164879, 2013, 1999, 2015, [2003, 2007, 2015, 1999])
], ("id", "YearBLT", "MinYear", "MaxYear", "ADP_Range"))

df.withColumn("ADP_YEAR", adp_year).show()

这将产生预期结果：

+   +   -+   -+   -+          +    +
|    id|YearBLT|MinYear|MaxYear|           ADP_Range|ADP_YEAR|
+   +   -+   -+   -+          +    +
|164876|   2010|   2004|   2009|        [2004, 2009]|    2009|
|164877|   2008|   2000|   2011|  [2000, 2002, 2011]|    2002|
|164878|   2000|   2003|   2011|        [2003, 2011]|    null|
|164879|   2013|   1999|   2015|[2003, 2007, 2015...|    2007|
+   +   -+   -+   -+          +    +

array_max和{}高阶函数都需要spark2.4或更高版本。在2.3或之前，您可以将以上表达式重新定义为

from pyspark.sql.functions import udf

max_adp_range = udf(max, "bigint")("ADP_Range")
closest_adp_range = udf(
    lambda xs, y: max(x for x in xs if x < y), "bigint"
)("ADP_Range", "YearBLT")

但是您应该预期会有显著的性能损失（单个udf应该更快，但仍然比本机表达式慢）。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

从整数列表中选择值的新列

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >