PySpark：如何对数组中实际为字符串列的dict值求和

+----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------+ |id |actions |clicks|spend | +----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------+ |d353|[{"action_type":"key1","value":"55"}, {"action_type":"key2","value":"1"}, {"action_type":"key3","value":"56"}, {"action_type":"key4","value":"56"}, {"action_type":"key5","value":"16"}, {"action_type":"key8","value":"12"}, {"action_type":"key12","value":"8"}, {"action_type":"key10","value":"12"}, {"action_type":"key19","value":"12"}] |8 |835 | |d353|[{"action_type":"key1","value":"50"}, {"action_type":"key2","value":"1"}, {"action_type":"key4","value":"51"}, {"action_type":"key3","value":"51"}, {"action_type":"key5","value":"2"}] |7 |582 | |d353|[{"action_type":"key1","value":"38"}, {"action_type":"key3","value":"38"}, {"action_type":"key4","value":"38"}, {"action_type":"key5","value":"6"}, {"action_type":"key8","value":"5"}, {"action_type":"key12","value":"5"}, {"action_type":"key10","value":"5"}, {"action_type":"key19","value":"5"}] |6 |205 | |56df|[{"action_type":"key1","value":"58"}, {"action_type":"key2","value":"2"}, {"action_type":"key3","value":"60"}, {"action_type":"key4","value":"60"}, {"action_type":"key5","value":"23"}, {"action_type":"key8","value":"11"}, {"action_type":"key11","value":"10"}, {"action_type":"key10","value":"11"}, {"action_type":"key19","value":"11"}] |15 |169 | |56df|[{"action_type":"key1","value":"3"}, {"action_type":"key4","value":"3"}, {"action_type":"key3","value":"3"}, {"action_type":"key5","value":"2"}, {"action_type":"key8","value":"25"}, {"action_type":"key11","value":"1"}, {"action_type":"key10","value":"25"}, {"action_type":"key19","value":"25"}] |1 |139 | |1f6f|[{"action_type":"key1","value":"37"}, {"action_type":"key4","value":"37"}, {"action_type":"key3","value":"37"}, {"action_type":"key5","value":"3"}, {"action_type":"key8","value":"1"}, {"action_type":"key10","value":"1"}, {"action_type":"key19","value":"1"}] |9 |939 | +----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------+

from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate() df = spark.read.parquet("actions.parquet") df.printSchema() root |-- id: string (nullable = true) |-- actions: string (nullable = true) |-- clicks: integer (nullable = true) |-- spend: integer (nullable = true)

|id |actions |clicks |spend | +----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-----------+ |d353|[{"action_type":"key1","value":"143"}, {"action_type":"key2","value":"40"}, {"action_type":"key3","value":"145"}, {"action_type":"key4","value":"145"}, {"action_type":"key5","value":"24"}, {"action_type":"key8","value":"23"}, {"action_type":"key12","value":"13"}, {"action_type":"key10","value":"17"}, {"action_type":"key19","value":"17"}] |21 |1622 | |56df|[{"action_type":"key1","value":"61"}, {"action_type":"key2","value":"2"}, {"action_type":"key3","value":"63"}, {"action_type":"key4","value":"63"}, {"action_type":"key5","value":"25"}, {"action_type":"key8","value":"36"}, {"action_type":"key11","value":"12"}, {"action_type":"key10","value":"36"}, {"action_type":"key19","value":"36"}] |16 |308 | |1f6f|[{"action_type":"key1","value":"37"}, {"action_type":"key3","value":"37"}, {"action_type":"key4","value":"37"}, {"action_type":"key5","value":"3"}, {"action_type":"key8","value":"1"}, {"action_type":"key10","value":"1"}, {"action_type":"key19","value":"1"}] |9 |939 | +----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+-----------+

schema = ArrayType( StructType( [ StructField("action_type", StringType()), StructField("value", StringType()) ] ) ) df = df.withColumn("actions", from_json(df.actions, schema))

1条回答

网友

1楼 · 发布于 2024-06-25 23:49:36

诀窍是使用groupby().agg()并为每个命名列提供一个字典函数，该函数在一个系列上进行所需的聚合。对于需要聚合的每个组，该函数将被调用一次。对于数值列，聚合函数仅为sum（）

如果操作是一个列表（dict或其他类型），并且您希望在每个聚合组中将它们串在一起，那么itertools.chain.from_iterable（）可以完成“操作”的大部分工作（请参见Flattening a shallow list in Python）。在这里，我们希望将链接的结果转换为列表，以便将chain（）嵌入应用list（）的lambda表达式中

import pandas as pd
from itertools import chain

# Some toy data.
df = pd.DataFrame(dict(actions=[['act1', 'act2'], ['act1', 'act3', 'act4'], ['act2', 'act4']],
                       clicks=[2,3,2],
                       spend=[800,650,743]),
                  index=[111,111,222])
df

#                 actions  clicks  spend
# 111        [act1, act2]       2    800
# 111  [act1, act3, act4]       3    650
# 222        [act2, act4]       2    743

# Group by index value, apply specified functions to groups within each named column.
#  See https://stackoverflow.com/questions/406121/flattening-a-shallow-list-in-python
df.groupby(level=0).agg(dict(actions=lambda x: list(chain.from_iterable(x)), clicks=sum, spend=sum))

#                             actions  clicks  spend
# 111  [act1, act2, act1, act3, act4]       5   1450
# 222                    [act2, act4]       2    743

如果操作是字符串，那么对于“操作”列，我们可能会尝试使用lambda表达式对每组字符串调用str.join（）

import pandas as pd

# Some toy data.
actions = ['[{"c":"d"}, {"a":"b"}]',
           '[{"c":"d"}, {"a":"b"}], {"e":"f"}',
           '[{"c":"d"}, {"a":"b"}]']
df = pd.DataFrame(dict(actions=actions,
                       clicks=[2,3,2],
                       spend=[800,650,743]),
                  index=[111,111,222])
df

#                                actions  clicks  spend
# 111             [{"c":"d"}, {"a":"b"}]       2    800
# 111  [{"c":"d"}, {"a":"b"}], {"e":"f"}       3    650
# 222             [{"c":"d"}, {"a":"b"}]       2    743

df.groupby(level=0).agg(dict(actions=lambda x: ''.join(x), clicks=sum, spend=sum))

#                                                actions  clicks  spend
# 111  [{"c":"d"}, {"a":"b"}][{"c":"d"}, {"a":"b"}], ...       5   1450
# 222                             [{"c":"d"}, {"a":"b"}]       2    743

但这并不完全正确，因为组111的“点击”看起来像[…][…]，但应该看起来像[…]。为了正确地聚合这些列表，我们需要首先对每个单元格求值（）以将其解释为列表，然后使用上面的chain（）函数聚合组中的所有列表，最后使用repr（）获得表示聚合列表的字符串

def agg_actions(actions):
    assert isinstance(actions, pd.Series), f"Expected Series, got {type(actions)}"
    actions_as_list = actions.map(eval)  # Interpret string '[...]' as list [...].
    agg_as_list = list(chain.from_iterable(actions_as_list))  # Aggregate whole series into one list.
    return repr(agg_as_list)  # Return a string representation of the big list.

df.groupby(level=0).agg(dict(actions=agg_actions, clicks=sum, spend=sum))

#                                                actions  clicks  spend
# 111  [{'c': 'd'}, {'a': 'b'}, {'c': 'd'}, {'a': 'b'...       5   1450
# 222                           [{'c': 'd'}, {'a': 'b'}]       2    743

相关问题更多 >

编程相关推荐

热门问题

热门文章