在pyspark UDF内部使用类方法

1条回答

网友
1楼 · 发布于 2024-10-02 18:25:49

与数据库连接一样，通过使用mapPartitions，您只能实例化有限数量的此类实例：
In [1]: from datetime import date ...: from astral import Astral ...: ...: df = spark.createDataFrame( ...: ((date(2019, 10, 4), 0), ...: (date(2019, 10, 4), 19)), ...: schema=("date", "longitude")) ...: ...: ...: def solar_noon(rows): ...: a = Astral() # initialize the class once per partition ...: return ((a.solar_noon_utc(date=r.date, longitude=r.longitude), *r) ...: for r in rows) # reuses the same Astral instance for all rows in this partition ...: ...: ...: (df.rdd ...: .mapPartitions(solar_noon) ...: .toDF(schema=("solar_noon_utc", *df.columns)) ...: .show() ...: ) ...: ...: + -+ + -+ | solar_noon_utc| date|longitude| + -+ + -+ |2019-10-04 13:48:58|2019-10-04| 0| |2019-10-04 12:32:58|2019-10-04| 19| + -+ + -+
这是相当有效的，因为函数（solar_noon）被赋予每个工作线程，并且类在每个分区中只初始化一次，这可以容纳许多行。在

相关问题更多 >

编程相关推荐

热门问题

热门文章