从Pyspark数据帧解析JSON字符串

from pyspark.sql.functions import * from pyspark.sql.types import * input_path = '/FileStore/tables/enrl/context2.json #path for the above file schema1 = StructType([StructField("context",StringType(),True)]) #Schema I'm providing raw_df = spark.read.json(input_path) cleansed_df = raw_df.withColumn("cleansed_value",regexp_replace(raw_df.value,'/','')).select('cleansed_value') #Removed extra '/' in the data cleansed_df.select(from_json('cleansed_value',schema=schema1)).show(1, truncate=False)

1条回答

网友

1楼 · 发布于 2024-10-04 01:37:41

空字符\u0000会影响JSON的解析。您也可以替换它们：

df = spark.read.json('path')

df2 = df.withColumn(
    'cleansed_value', 
    F.regexp_replace('value','[\u0000/]','')
).withColumn(
    'parsed', 
    F.from_json('cleansed_value','context string')
)

df2.show(20,0)
+           -+         +   +
|value                  |cleansed_value    |parsed|
+           -+         +   +
|/{"context":"data"}|{"context":"data"}|[data]|
+           -+         +   +

相关问题更多 >

编程相关推荐

热门问题

热门文章