PySpark:如何对数组中实际为字符串列的dict值求和

2024-06-25 23:49:36 发布

您现在位置:Python中文网/ 问答频道 /正文

这是一个数据帧:

+----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------+
|id  |actions                                                                                                                                                                                                                                                                                                                                                                                                                                                         |clicks|spend      |
+----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------+
|d353|[{"action_type":"key1","value":"55"}, {"action_type":"key2","value":"1"}, {"action_type":"key3","value":"56"}, {"action_type":"key4","value":"56"}, {"action_type":"key5","value":"16"}, {"action_type":"key8","value":"12"}, {"action_type":"key12","value":"8"}, {"action_type":"key10","value":"12"}, {"action_type":"key19","value":"12"}]                                                                                                                  |8     |835        |
|d353|[{"action_type":"key1","value":"50"}, {"action_type":"key2","value":"1"}, {"action_type":"key4","value":"51"}, {"action_type":"key3","value":"51"}, {"action_type":"key5","value":"2"}]                                                                                                                                                                                                                                                                         |7     |582        |
|d353|[{"action_type":"key1","value":"38"}, {"action_type":"key3","value":"38"}, {"action_type":"key4","value":"38"}, {"action_type":"key5","value":"6"}, {"action_type":"key8","value":"5"}, {"action_type":"key12","value":"5"}, {"action_type":"key10","value":"5"}, {"action_type":"key19","value":"5"}]                                                                                                                                                          |6     |205        |
|56df|[{"action_type":"key1","value":"58"}, {"action_type":"key2","value":"2"}, {"action_type":"key3","value":"60"}, {"action_type":"key4","value":"60"}, {"action_type":"key5","value":"23"}, {"action_type":"key8","value":"11"}, {"action_type":"key11","value":"10"}, {"action_type":"key10","value":"11"}, {"action_type":"key19","value":"11"}]                                                                                                                 |15    |169        |
|56df|[{"action_type":"key1","value":"3"}, {"action_type":"key4","value":"3"}, {"action_type":"key3","value":"3"}, {"action_type":"key5","value":"2"}, {"action_type":"key8","value":"25"}, {"action_type":"key11","value":"1"}, {"action_type":"key10","value":"25"}, {"action_type":"key19","value":"25"}]                                                                                                                                                          |1     |139        |
|1f6f|[{"action_type":"key1","value":"37"}, {"action_type":"key4","value":"37"}, {"action_type":"key3","value":"37"}, {"action_type":"key5","value":"3"}, {"action_type":"key8","value":"1"}, {"action_type":"key10","value":"1"}, {"action_type":"key19","value":"1"}]                                                                                                                                                                                               |9     |939        |
+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------+

我希望所有3列按id聚合

问题是,我看不到如何在“actions”列(一个字符串)上执行该操作。而且数组中的元素数量也不相等。我希望按值对其进行聚合,但要将其作为字符串保留在末尾,因为我将把该数据帧写入数据库表

下面是一个模式:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

df = spark.read.parquet("actions.parquet")
df.printSchema()
root
 |-- id: string (nullable = true)
 |-- actions: string (nullable = true)
 |-- clicks: integer (nullable = true)
 |-- spend: integer (nullable = true)

预期结果:

|id  |actions                                                                                                                                                                                                                                                                                                                                                 |clicks |spend      |
+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-----------+
|d353|[{"action_type":"key1","value":"143"}, {"action_type":"key2","value":"40"}, {"action_type":"key3","value":"145"}, {"action_type":"key4","value":"145"}, {"action_type":"key5","value":"24"}, {"action_type":"key8","value":"23"}, {"action_type":"key12","value":"13"}, {"action_type":"key10","value":"17"}, {"action_type":"key19","value":"17"}]     |21     |1622       |
|56df|[{"action_type":"key1","value":"61"}, {"action_type":"key2","value":"2"}, {"action_type":"key3","value":"63"}, {"action_type":"key4","value":"63"}, {"action_type":"key5","value":"25"}, {"action_type":"key8","value":"36"}, {"action_type":"key11","value":"12"}, {"action_type":"key10","value":"36"}, {"action_type":"key19","value":"36"}]         |16     |308        |
|1f6f|[{"action_type":"key1","value":"37"}, {"action_type":"key3","value":"37"}, {"action_type":"key4","value":"37"}, {"action_type":"key5","value":"3"}, {"action_type":"key8","value":"1"}, {"action_type":"key10","value":"1"}, {"action_type":"key19","value":"1"}]                                                                                       |9      |939        |
+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-----------+

你能给我指一下正确的方向吗

编辑: 我猜我需要将该字符串字段转换为地图列表,进行计算,然后再次转换回字符串

最初我的计划是将actions列拆分为多个列,然后进行求和。 我尝试了以下类似的方法:

schema = ArrayType(
    StructType(
        [
            StructField("action_type", StringType()),
            StructField("value", StringType())
        ]
    )
)

df = df.withColumn("actions", from_json(df.actions, schema))

所以模式现在是

root
 |-- id: string (nullable = true)
 |-- actions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- action_type: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- clicks: integer (nullable = true)
 |-- spend: integer (nullable = true)

我得到了这样的东西

[{key1, 55}, {key2, 1}, {key3, 56}, {key4, 56} ...] 

但后来我被卡住了。我不知道下一步该做什么,也不知道如何获得上面提到的“预期结果”


Tags: actionsidtruevaluetypeactionkey2key1
1条回答
网友
1楼 · 发布于 2024-06-25 23:49:36

诀窍是使用groupby().agg()并为每个命名列提供一个字典函数,该函数在一个系列上进行所需的聚合。对于需要聚合的每个组,该函数将被调用一次。对于数值列,聚合函数仅为sum()

如果操作是一个列表(dict或其他类型),并且您希望在每个聚合组中将它们串在一起,那么itertools.chain.from_iterable()可以完成“操作”的大部分工作(请参见Flattening a shallow list in Python)。在这里,我们希望将链接的结果转换为列表,以便将chain()嵌入应用list()的lambda表达式中

import pandas as pd
from itertools import chain

# Some toy data.
df = pd.DataFrame(dict(actions=[['act1', 'act2'], ['act1', 'act3', 'act4'], ['act2', 'act4']],
                       clicks=[2,3,2],
                       spend=[800,650,743]),
                  index=[111,111,222])
df

#                 actions  clicks  spend
# 111        [act1, act2]       2    800
# 111  [act1, act3, act4]       3    650
# 222        [act2, act4]       2    743

# Group by index value, apply specified functions to groups within each named column.
#  See https://stackoverflow.com/questions/406121/flattening-a-shallow-list-in-python
df.groupby(level=0).agg(dict(actions=lambda x: list(chain.from_iterable(x)), clicks=sum, spend=sum))

#                             actions  clicks  spend
# 111  [act1, act2, act1, act3, act4]       5   1450
# 222                    [act2, act4]       2    743

如果操作是字符串,那么对于“操作”列,我们可能会尝试使用lambda表达式对每组字符串调用str.join()

import pandas as pd

# Some toy data.
actions = ['[{"c":"d"}, {"a":"b"}]',
           '[{"c":"d"}, {"a":"b"}], {"e":"f"}',
           '[{"c":"d"}, {"a":"b"}]']
df = pd.DataFrame(dict(actions=actions,
                       clicks=[2,3,2],
                       spend=[800,650,743]),
                  index=[111,111,222])
df

#                                actions  clicks  spend
# 111             [{"c":"d"}, {"a":"b"}]       2    800
# 111  [{"c":"d"}, {"a":"b"}], {"e":"f"}       3    650
# 222             [{"c":"d"}, {"a":"b"}]       2    743

df.groupby(level=0).agg(dict(actions=lambda x: ''.join(x), clicks=sum, spend=sum))

#                                                actions  clicks  spend
# 111  [{"c":"d"}, {"a":"b"}][{"c":"d"}, {"a":"b"}], ...       5   1450
# 222                             [{"c":"d"}, {"a":"b"}]       2    743

但这并不完全正确,因为组111的“点击”看起来像[…][…],但应该看起来像[…]。为了正确地聚合这些列表,我们需要首先对每个单元格求值()以将其解释为列表,然后使用上面的chain()函数聚合组中的所有列表,最后使用repr()获得表示聚合列表的字符串

def agg_actions(actions):
    assert isinstance(actions, pd.Series), f"Expected Series, got {type(actions)}"
    actions_as_list = actions.map(eval)  # Interpret string '[...]' as list [...].
    agg_as_list = list(chain.from_iterable(actions_as_list))  # Aggregate whole series into one list.
    return repr(agg_as_list)  # Return a string representation of the big list.

df.groupby(level=0).agg(dict(actions=agg_actions, clicks=sum, spend=sum))

#                                                actions  clicks  spend
# 111  [{'c': 'd'}, {'a': 'b'}, {'c': 'd'}, {'a': 'b'...       5   1450
# 222                           [{'c': 'd'}, {'a': 'b'}]       2    743

相关问题 更多 >