使用JSON模板的Pyspark映射(重新排序/重命名)列

2024-08-31 09:05:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我问了以下问题here:

全文如下:

我有这样一个数据框:

|customer_key|order_id|subtotal|address        |
------------------------------------------------
|12345       |O12356  |123.45  |123 Road Street|
|10986       |945764  |70.00   |634 Road Street|
|32576       |678366  |29.95   |369 Road Street|
|67896       |198266  |837.69  |785 Road Street|

我想根据包含当前列名和所需列名的以下JSON对列进行重新排序/重命名:

{
"customer_key": "cust_id",
"order_id": "transaction_id",
"address": "shipping_address",
"subtotal": "subtotal"
}

要获得结果数据帧,请执行以下操作:

|cust_id|transaction_id|shipping_address|subtotal|
--------------------------------------------------
|12345  |O12356        |123 Road Street |123.45  |
|10986  |945764        |634 Road Street |70.00   |
|32576  |678366        |369 Road Street |29.95   |
|67896  |198266        |785 Road Street |837.69  |

这是可能的吗?如果这样做更简单,那么列的顺序并不重要

关键的区别是,我现在正在寻找一种在pyspark而不是熊猫中实现这一点的方法


Tags: 数据keyidjsonstreethereaddressorder
3条回答

您可以简单地使用以下各项:

new_mapping = {
"customer_key": "cust_id",
"order_id": "transaction_id",
"address": "shipping_address",
"subtotal": "subtotal"
}

for key, value in new_mapping.items():
        df = df.withColumnRenamed(key, value)

# Re-order df
new_columns = [col_name for col_name in new_mapping.values()]
df = df.select(*new_columns)

注意:现在顺序取决于字典。在Python2中,词条是无序的,因此必须使用OrderedDict,而在Python3中,词条具有顺序并保持插入顺序

您可以使用方法toDF

dct = {
"customer_key": "cust_id",
"order_id": "transaction_id",
"address": "shipping_address",
"subtotal": "subtotal"
}

df.toDF(*[dct[col] for col in df.columns])

将“选择”与别名一起使用:

select_expr = [col(c).alias(a) for c, a in mappings.items()]

df = df.select(*select_expr)

相关问题 更多 >