如何使用pysp将具有多个可能值的Json数组列表转换为dataframe中的列

dbutils.fs.put("/tmp/test.json", '''{ "userEmail": "rod@test.com", "parameters": [ { "intValue": "0", "name": "classroom:num_courses_created" }, { "boolValue": true, "name": "accounts:is_disabled" }, { "name": "classroom:role", "stringValue": "student" } ] }''', True)

from pyspark.sql.functions import explode tempDf = testJsonData.select("userEmail", explode("parameters").alias("parameters_exploded")) explodedColsDf = tempDf.select("userEmail", "parameters_exploded.*")

#turn intValue into an Int column explodedColsDf = explodedColsDf.withColumn("intValue", explodedColsDf.intValue.cast(IntegerType())) pivotedDf = explodedColsDf.groupBy("userEmail").pivot("name").sum("intValue")

1条回答

网友
1楼 · 发布于 2024-05-19 16:35:56

下面的代码将提供的示例JSON转换为dataframe（不使用PySpark）。在
导入库
import numpy as np import pandas as pd
分配变量
^{pr2}$
将JSON分配给变量
data = [{ "userEmail": "rod@test.com", "parameters": [ { "intValue": "0", "name": "classroom:num_courses_created" }, { "boolValue": true, "name": "accounts:is_disabled" }, { "name": "classroom:role", "stringValue": "student" } ] }, { "userEmail": "EMAIL2@test.com", "parameters": [ { "intValue": "1", "name": "classroom:num_courses_created" }, { "boolValue": false, "name": "accounts:is_disabled" }, { "name": "classroom:role", "stringValue": "student2" } ] }
]
将字典转换为列的函数
def get_col(x): y = pd.DataFrame(x, index=[0]) col_name = y.iloc[0]['name'] y = y.drop(columns=['name']) y.columns = [col_name] return y
迭代JSON列表
df = pd.DataFrame() for item in range(len(data)): # Initialize empty dataframe trow = pd.DataFrame() temp = pd.DataFrame(data[item]) for i in range(temp.shape[0]): # Read each row x = temp.iloc[i]['parameters'] trow = pd.concat([trow,get_col(x)], axis=1) trow['userEmail'] = temp.iloc[i]['userEmail'] df = df.append(trow) # Rearrange columns, drop those that are not needed df = df[['userEmail', 'classroom:num_courses_created', 'accounts:is_disabled', 'classroom:role']]
输出：
。。。。。。。。。。。。。。。。。。。。。。。。。上一次编辑。。。。。。。。。。。。。。。。。。。。。在
将JSON/嵌套字典转换为数据帧
temp = pd.DataFrame(data) # Initialize empty dataframe df = pd.DataFrame() for i in range(temp.shape[0]): # Read each row x = temp.iloc[i]['parameters'] temp1 = pd.DataFrame([x], columns=x.keys()) temp1['userEmail'] = temp.iloc[i]['userEmail'] # Convert nested key:value pairs y = x['name'].split(sep=':') temp1['name_' + y[0]] = y[1] # Combine to dataframe df = df.append(temp1, sort=False) # Rearrange columns, drop those that are not needed df = df[['userEmail', 'intValue', 'stringValue', 'boolValue', 'name_classroom', 'name_accounts']]
输出
编辑-1 根据更新后的问题截图，下面的代码应该可以工作。在
分配变量

分配变量

相关问题更多 >

编程相关推荐

热门问题

热门文章