这是一个数据帧:
+----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------+
|id |actions |clicks|spend |
+----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------+
|d353|[{"action_type":"key1","value":"55"}, {"action_type":"key2","value":"1"}, {"action_type":"key3","value":"56"}, {"action_type":"key4","value":"56"}, {"action_type":"key5","value":"16"}, {"action_type":"key8","value":"12"}, {"action_type":"key12","value":"8"}, {"action_type":"key10","value":"12"}, {"action_type":"key19","value":"12"}] |8 |835 |
|d353|[{"action_type":"key1","value":"50"}, {"action_type":"key2","value":"1"}, {"action_type":"key4","value":"51"}, {"action_type":"key3","value":"51"}, {"action_type":"key5","value":"2"}] |7 |582 |
|d353|[{"action_type":"key1","value":"38"}, {"action_type":"key3","value":"38"}, {"action_type":"key4","value":"38"}, {"action_type":"key5","value":"6"}, {"action_type":"key8","value":"5"}, {"action_type":"key12","value":"5"}, {"action_type":"key10","value":"5"}, {"action_type":"key19","value":"5"}] |6 |205 |
|56df|[{"action_type":"key1","value":"58"}, {"action_type":"key2","value":"2"}, {"action_type":"key3","value":"60"}, {"action_type":"key4","value":"60"}, {"action_type":"key5","value":"23"}, {"action_type":"key8","value":"11"}, {"action_type":"key11","value":"10"}, {"action_type":"key10","value":"11"}, {"action_type":"key19","value":"11"}] |15 |169 |
|56df|[{"action_type":"key1","value":"3"}, {"action_type":"key4","value":"3"}, {"action_type":"key3","value":"3"}, {"action_type":"key5","value":"2"}, {"action_type":"key8","value":"25"}, {"action_type":"key11","value":"1"}, {"action_type":"key10","value":"25"}, {"action_type":"key19","value":"25"}] |1 |139 |
|1f6f|[{"action_type":"key1","value":"37"}, {"action_type":"key4","value":"37"}, {"action_type":"key3","value":"37"}, {"action_type":"key5","value":"3"}, {"action_type":"key8","value":"1"}, {"action_type":"key10","value":"1"}, {"action_type":"key19","value":"1"}] |9 |939 |
+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------+
我希望所有3列按id聚合
问题是,我看不到如何在“actions”列(一个字符串)上执行该操作。而且数组中的元素数量也不相等。我希望按值对其进行聚合,但要将其作为字符串保留在末尾,因为我将把该数据帧写入数据库表
下面是一个模式:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
df = spark.read.parquet("actions.parquet")
df.printSchema()
root
|-- id: string (nullable = true)
|-- actions: string (nullable = true)
|-- clicks: integer (nullable = true)
|-- spend: integer (nullable = true)
预期结果:
|id |actions |clicks |spend |
+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-----------+
|d353|[{"action_type":"key1","value":"143"}, {"action_type":"key2","value":"40"}, {"action_type":"key3","value":"145"}, {"action_type":"key4","value":"145"}, {"action_type":"key5","value":"24"}, {"action_type":"key8","value":"23"}, {"action_type":"key12","value":"13"}, {"action_type":"key10","value":"17"}, {"action_type":"key19","value":"17"}] |21 |1622 |
|56df|[{"action_type":"key1","value":"61"}, {"action_type":"key2","value":"2"}, {"action_type":"key3","value":"63"}, {"action_type":"key4","value":"63"}, {"action_type":"key5","value":"25"}, {"action_type":"key8","value":"36"}, {"action_type":"key11","value":"12"}, {"action_type":"key10","value":"36"}, {"action_type":"key19","value":"36"}] |16 |308 |
|1f6f|[{"action_type":"key1","value":"37"}, {"action_type":"key3","value":"37"}, {"action_type":"key4","value":"37"}, {"action_type":"key5","value":"3"}, {"action_type":"key8","value":"1"}, {"action_type":"key10","value":"1"}, {"action_type":"key19","value":"1"}] |9 |939 |
+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-----------+
你能给我指一下正确的方向吗
编辑: 我猜我需要将该字符串字段转换为地图列表,进行计算,然后再次转换回字符串
最初我的计划是将actions列拆分为多个列,然后进行求和。 我尝试了以下类似的方法:
schema = ArrayType(
StructType(
[
StructField("action_type", StringType()),
StructField("value", StringType())
]
)
)
df = df.withColumn("actions", from_json(df.actions, schema))
所以模式现在是
root
|-- id: string (nullable = true)
|-- actions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action_type: string (nullable = true)
| | |-- value: string (nullable = true)
|-- clicks: integer (nullable = true)
|-- spend: integer (nullable = true)
我得到了这样的东西
[{key1, 55}, {key2, 1}, {key3, 56}, {key4, 56} ...]
但后来我被卡住了。我不知道下一步该做什么,也不知道如何获得上面提到的“预期结果”
诀窍是使用groupby().agg()并为每个命名列提供一个字典函数,该函数在一个系列上进行所需的聚合。对于需要聚合的每个组,该函数将被调用一次。对于数值列,聚合函数仅为sum()
如果操作是一个列表(dict或其他类型),并且您希望在每个聚合组中将它们串在一起,那么itertools.chain.from_iterable()可以完成“操作”的大部分工作(请参见Flattening a shallow list in Python)。在这里,我们希望将链接的结果转换为列表,以便将chain()嵌入应用list()的lambda表达式中
如果操作是字符串,那么对于“操作”列,我们可能会尝试使用lambda表达式对每组字符串调用str.join()
但这并不完全正确,因为组111的“点击”看起来像[…][…],但应该看起来像[…]。为了正确地聚合这些列表,我们需要首先对每个单元格求值()以将其解释为列表,然后使用上面的chain()函数聚合组中的所有列表,最后使用repr()获得表示聚合列表的字符串
相关问题 更多 >
编程相关推荐