如何检查手头的值是否位于某个PySpark数据帧的特定列中？

2024-09-30 16:20:48 发布

男 | 程序猿一只，喜欢编程写python代码。

我有一个PySpark数据帧trips，我正在其上执行聚合。对于每个PULocationID，我首先计算total_amount的平均值，然后计算行程数，最后计算DOLocationID位于mtrips的DOLocationID列中的行程数，这是另一个PySpark数据帧

我在下面包含了trips和mtrips的模式

我当前的代码如下，但不完整：

import pyspark.sql.functions as F
cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0))
(
    trips
        .groupBy('PULocationID', 'DOLocationID')
        .agg(
            F.mean('total_amount').alias('avg_total_amt'),
            F.count('*').alias('trip_count'),
            cnt_cond(mtrips.DOLocationID.contains(trips.DOLocationID)).alias('trips_to_pop')
        )
        .show(200)
)

trips.printSchema()

# root
#  |-- VendorID: integer (nullable = true)
#  |-- tpep_pickup_datetime: timestamp (nullable = true)
#  |-- tpep_dropoff_datetime: timestamp (nullable = true)
#  |-- passenger_count: integer (nullable = true)
#  |-- trip_distance: double (nullable = true)
#  |-- RatecodeID: integer (nullable = true)
#  |-- store_and_fwd_flag: string (nullable = true)
#  |-- PULocationID: integer (nullable = true)
#  |-- DOLocationID: integer (nullable = true)
#  |-- payment_type: integer (nullable = true)
#  |-- fare_amount: double (nullable = true)
#  |-- extra: double (nullable = true)
#  |-- mta_tax: double (nullable = true)
#  |-- tip_amount: double (nullable = true)
#  |-- tolls_amount: double (nullable = true)
#  |-- improvement_surcharge: double (nullable = true)
#  |-- total_amount: double (nullable = true)
#  |-- congestion_surcharge: double (nullable = true)

mtrips.printSchema()

# root
#  |-- DOLocationID: integer (nullable = true)
#  |-- pcount: long (nullable = true)

Tags：数据 true count alias integer amount pyspark total

1条回答

网友

1楼 · 发布于 2024-09-30 16:20:48

以下是解决问题的代码行：

cnt_cond(col('DOLocationID').isin([i['DOLocationID'] for i in mtrips.collect()])).alias('trips_to_pop')

如何检查手头的值是否位于某个PySpark数据帧的特定列中？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何检查手头的值是否位于某个PySpark数据帧的特定列中？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >