我正在对奥运会表现的数据进行分析,并想对哪位运动员获得最多奖牌进行概述。首先,我创建额外的列,因为在原始数据集中,赢得的奖牌由字符串(“金牌”、“银牌”等)或NA表示
totalDF = olympicDF.count()
medalswonDF = olympicDF\
.where(col("Medal")!="NA")\
.withColumn("Gold", when(col("Medal")== "Gold",("1")))\
.withColumn("Silver", when(col("Medal")== "Silver",("1")))\
.withColumn("Bronze", when(col("Medal")== "Bronze",("1")))\
.withColumn("Total", when(col("Medal")!= "NA", ("1"))) -> the "1" is just a placeholder for now
下一步,我想为25名最成功的运动员展示一张表格(就获得的奖牌而言)
medalswonDF.cache() # optimization to make the processing faster
medalswonDF.where(col("Medal")!="NA")\
.select("Name", "Gold", "Silver", "Bronze")\
.groupBy("Name")\
.agg(count("Gold")),\
(count("Silver")),\
(count("Bronze"))\
.orderBy("Gold").desc()\
.select("Name", "Gold", "Silver", "Bronze").show(25,True)
但是,我不断收到错误“TypeError:“Column”对象不可调用”。我理解,如果您想应用一个函数,而该函数不能应用于列,除其他原因外,就是这种情况,但据我理解,这不应该是这里的原因
参考模式:
root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Sex: string (nullable = true)
|-- Age: integer (nullable = true)
|-- Height: integer (nullable = true)
|-- Weight: integer (nullable = true)
|-- Team: string (nullable = true)
|-- NOC: string (nullable = true)
|-- Games: string (nullable = true)
|-- Year: string (nullable = true)
|-- Season: string (nullable = true)
|-- City: string (nullable = true)
|-- Sport: string (nullable = true)
|-- Event: string (nullable = true)
|-- Medal: string (nullable = true)
|-- Gold: string (nullable = true)
|-- Silver: string (nullable = true)
|-- Bronze: string (nullable = true)
|-- Total: string (nullable = true)
我做错了什么
在需要关闭agg之前,您正在使用额外的括号来关闭agg
按如下所示更改代码
相关问题 更多 >
编程相关推荐