我的spark数据框如下所示:
+---------+---------------------------+
|country |sports |
+---------+---------------------------+
|India |[Cricket, Hockey, Football]|
|Sri Lanka|[Cricket, Football] |
+---------+---------------------------+
“运动”列中的每项运动都用代码表示:
sport_to_code_map = {
'Cricket' : 0x0001,
'Hockey' : 0x0002,
'Football' : 0x0004
}
现在我想添加一个名为sportsInt
的新列,它是与上述映射中的运动字符串相关联的每个代码的按位或的结果,从而导致:
+---------+---------------------------+---------+
|country |sports |sportsInt|
+---------+---------------------------+---------+
|India |[Cricket, Hockey, Football]|7 |
|Sri Lanka|[Cricket, Football] |5 |
+---------+---------------------------+---------+
我知道一种方法是使用UDF,它是这样的:
def get_sport_to_code(sport_name):
sport_to_code_map = {
'Cricket': 0x0001,
'Hockey': 0x0002,
'Football': 0x0004
}
if feature not in sport_to_code_map:
raise Exception(f'Unknown Sport: {sport_name}')
return sport_to_code_map.get(sport_name)
def sport_to_code(sports):
if not sports:
return None
code = 0x0000
for sport in sports:
code = code | get_sport_to_code(sport)
return code
import pyspark.sql.functions as F
sport_to_code_udf = F.udf(sport_to_code, F.StringType())
df.withColumn('sportsInt',sport_to_code_udf('sports'))
但是有没有办法用spark函数来实现呢?而不是udf
从
Spark-2.4+
我们可以在这种情况下使用聚合高阶函数和bitwise or
操作符Example:
如果要避免在
sport_to_code_map
dict中进行查找,请使用.replace
:相关问题 更多 >
编程相关推荐