pySpark在分组d中查找公共值

2024-05-20 16:46:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在pySpark中的一个数据帧上应用groupBy和pivot来创建组之间的公共值。 例如,数据看起来像:

+--------+---------+---------+
|PlayerID|PitcherID|ThrowHand|
+--------+---------+---------+
|10000598| 10000104|        R|
|10000908| 10000104|        R|
|10000489| 10000104|        R|
|10000734| 10000104|        R|
|10006568| 10000104|        R|
|10000125| 10000895|        L|
|10000133| 10000895|        L|
|10006354| 10000895|        L|
|10000127| 10000895|        L|
|10000121| 10000895|        L|

申请后:

^{pr2}$

我得到了一些东西比如:-

+--------+----+---+
|PlayerID| L  |  R|
+--------+----+---+
|10000591|  11| 43|
|10000172|  22|101|
|10000989|  05| 19|
|10000454|  05| 17|
|10000723|  11| 33|
|10001989|  11| 38|
|10005243|  20| 60|
|10003366|  11| 26|
|10006058|  02| 09|
+--------+----+---+

有没有什么方法可以让我在上面的L和R的计数中得到“pitchrid”的共同值。在

我的意思是对于PlayerID=10000591,我有11个pitchrid,ThrowHand是L,43个pitchrid,ThrowHand是43。有可能有些投手在这11个投手和43个投手中很常见。在

有什么办法我能弄到这些普通的猪笼草吗?在


Tags: 数据方法pyspark计数pivotgroupby办法投手
1条回答
网友
1楼 · 发布于 2024-05-20 16:46:21

您应该首先为每个throuwhandas获取pitchrids的集合

import pyspark.sql.functions as F
#collect set of pitchers in addition to count of ThrowHand
df = df.groupBy('PlayerID').pivot('ThrowHand').agg(F.count('ThrowHand').alias('count'), F.collect_set('PitcherID').alias('PitcherID')).drop('null')

它应该给你dataframe作为

^{pr2}$

然后写一个udf函数来得到pitcherIDs

#columns with pitcherid and count
pitcherColumns = [x for x in df.columns if 'PitcherID' in x]
countColumns = [x for x in df.columns if 'count' in x]

#udf function to find the common pitcher between the collected pitchers
@F.udf(T.ArrayType(T.StringType()))
def commonFindingUdf(*pitcherCols):
    common = pitcherCols[0]
    for pitcher in pitcherCols[1:]:
        common = set(common).intersection(pitcher)
    return [x for x in common]

#calling the udf function and selecting the required columns
df.select(F.col('PlayerID'), commonFindingUdf(*[col(x) for x in pitcherColumns]).alias('common_PitcherID'), *countColumns)

这应该给你最后的dataframe作为

root
 |  PlayerID: string (nullable = true)
 |  common_PitcherID: array (nullable = true)
 |    |  element: string (containsNull = true)
 |  L_count: long (nullable = false)
 |  R_count: long (nullable = false)

我希望答案是有帮助的

相关问题 更多 >