pySpark在分组d中查找公共值

+--------+---------+---------+ |PlayerID|PitcherID|ThrowHand| +--------+---------+---------+ |10000598| 10000104| R| |10000908| 10000104| R| |10000489| 10000104| R| |10000734| 10000104| R| |10006568| 10000104| R| |10000125| 10000895| L| |10000133| 10000895| L| |10006354| 10000895| L| |10000127| 10000895| L| |10000121| 10000895| L|

+--------+----+---+ |PlayerID| L | R| +--------+----+---+ |10000591| 11| 43| |10000172| 22|101| |10000989| 05| 19| |10000454| 05| 17| |10000723| 11| 33| |10001989| 11| 38| |10005243| 20| 60| |10003366| 11| 26| |10006058| 02| 09| +--------+----+---+

1条回答

网友

1楼 · 发布于 2024-05-20 16:46:21

您应该首先为每个throuwhandas获取pitchrids的集合

import pyspark.sql.functions as F
#collect set of pitchers in addition to count of ThrowHand
df = df.groupBy('PlayerID').pivot('ThrowHand').agg(F.count('ThrowHand').alias('count'), F.collect_set('PitcherID').alias('PitcherID')).drop('null')

它应该给你dataframe作为

^{pr2}$

然后写一个udf函数来得到pitcherIDs

#columns with pitcherid and count
pitcherColumns = [x for x in df.columns if 'PitcherID' in x]
countColumns = [x for x in df.columns if 'count' in x]

#udf function to find the common pitcher between the collected pitchers
@F.udf(T.ArrayType(T.StringType()))
def commonFindingUdf(*pitcherCols):
    common = pitcherCols[0]
    for pitcher in pitcherCols[1:]:
        common = set(common).intersection(pitcher)
    return [x for x in common]

#calling the udf function and selecting the required columns
df.select(F.col('PlayerID'), commonFindingUdf(*[col(x) for x in pitcherColumns]).alias('common_PitcherID'), *countColumns)

这应该给你最后的dataframe作为

root
 |  PlayerID: string (nullable = true)
 |  common_PitcherID: array (nullable = true)
 |    |  element: string (containsNull = true)
 |  L_count: long (nullable = false)
 |  R_count: long (nullable = false)

我希望答案是有帮助的

相关问题更多 >

编程相关推荐

热门问题

热门文章