Jupyter笔记本中的PySpark:“列”对象不可调用

2024-05-20 14:09:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在对奥运会表现的数据进行分析,并想对哪位运动员获得最多奖牌进行概述。首先,我创建额外的列,因为在原始数据集中,赢得的奖牌由字符串(“金牌”、“银牌”等)或NA表示

totalDF = olympicDF.count()
medalswonDF = olympicDF\
   .where(col("Medal")!="NA")\
.withColumn("Gold", when(col("Medal")== "Gold",("1")))\
.withColumn("Silver", when(col("Medal")== "Silver",("1")))\
.withColumn("Bronze", when(col("Medal")== "Bronze",("1")))\
.withColumn("Total", when(col("Medal")!=  "NA", ("1"))) -> the  "1" is just a placeholder for now

下一步,我想为25名最成功的运动员展示一张表格(就获得的奖牌而言)

medalswonDF.cache() # optimization to make the processing faster

medalswonDF.where(col("Medal")!="NA")\
                     .select("Name", "Gold", "Silver", "Bronze")\
                     .groupBy("Name")\
                     .agg(count("Gold")),\
                          (count("Silver")),\
                            (count("Bronze"))\
.orderBy("Gold").desc()\
.select("Name", "Gold", "Silver", "Bronze").show(25,True)

但是,我不断收到错误“TypeError:“Column”对象不可调用”。我理解,如果您想应用一个函数,而该函数不能应用于列,除其他原因外,就是这种情况,但据我理解,这不应该是这里的原因

参考模式:

root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Height: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Team: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Games: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Season: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Medal: string (nullable = true)
 |-- Gold: string (nullable = true)
 |-- Silver: string (nullable = true)
 |-- Bronze: string (nullable = true)
 |-- Total: string (nullable = true)

我做错了什么


Tags: nametruestringsilvercountcolintegerwhen
1条回答
网友
1楼 · 发布于 2024-05-20 14:09:50

在需要关闭agg之前,您正在使用额外的括号来关闭agg

按如下所示更改代码

medalswonDF.where(col("Medal")!="NA")\
                 .select("Name", "Gold", "Silver", "Bronze")\
                 .groupBy("Name")\
                 .agg(count("Gold").alias("Gold_count"),
                      count("Silver").alias("Silver_count"),
                      count("Bronze").alias("Bronze_count")) \
                 .orderBy("Gold_count").desc()\
                 .select("Name", "Gold_count", "Silver_count", "Bronze_count").show(25,True)

相关问题 更多 >