如何返回Pyspark自定义项中的double列表？

2024-07-04 08:03:04 发布

您现在位置：Python中文网/ 问答频道 /正文

7201

网友

男 | 程序猿一只，喜欢编程写python代码。

from pyspark.sql import functions as func

我有一个Pyspark数据帧，叫做df。它具有以下模式：

id: string
item: string
data: double

我对其进行以下操作：

^{pr2}$

另外，我定义了用户定义函数iqrOnList：

@udf
def iqrOnList(accumulatorsList: list):
  import numpy as np 

  Q1 = np.percentile(accumulatorsList, 25)
  Q3 = np.percentile(accumulatorsList, 75) 
  IQR = Q3 - Q1

  lowerFence = Q1 - (1.5 * IQR)
  upperFence = Q3 + (1.5 * IQR)

  return [elem if (elem >= lowerFence and elem <= upperFence) else None for elem in accumulatorsList]

我是这样使用这个自定义项的：

grouped_df = grouped_df.withColumn("SecondList", iqrOnList(grouped_df.dataList))

这些操作在输出中返回数据帧grouped_df，如下所示：

id: string
item: string
dataList: array
SecondList: string

问题：

SecondList具有我期望的正确值（例如[1, 2, 3, null, 3, null, 2]），但返回类型错误（string而不是{}，尽管它保持了它的形式）。在

问题是我需要将它存储为array，与dataList完全相同。在

问题：

1）如何保存正确的类型？在

2）此UDF的性能昂贵。我读到here熊猫UDF的性能比普通UDF好得多。在Pandas UDF中这种方法的等效性是什么？在

奖金问题（优先级较低）：func.collect_list(df.data)不收集null值，而{}有。我也想收集，没有replacing all null values with another default value怎么办？在

Tags： import df string np null q3 q1 udf

1条回答

网友

1楼 · 发布于 2024-07-04 08:03:04

您仍然可以使用当前语法，只需在注释声明中提供返回类型

import pyspark.sql.types as Types
@udf(returnType=Types.ArrayType(Types.DoubleType()))

如何返回Pyspark自定义项中的double列表？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何返回Pyspark自定义项中的double列表？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >