将csv dict列转换为pysp行

2024-05-02 04:28:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我的csv文件包含两列

  1. 身份证
  2. cbgs(用“”括起来的字典密钥对值)

示例Csv数据与记事本中的类似 单元格B2包含json密钥对作为字符串。在

id,cbgs sg:bd1f26e681264baaa4b44083891c886a,"{""060372623011"":166,""060372655203"":70,""060377019021"":34}" sg:04c7f777f01c4c75bbd9e43180ce811f,"{""060372073012"":7}"

现在尝试转换如下

id,cbgs,value
sg:bd1f26e681264baaa4b44083891c886a,060372623011,166
sg:bd1f26e681264baaa4b44083891c886a,060372655203,70
sg:bd1f26e681264baaa4b44083891c886a,060377019021,34
sg:04c7f777f01c4c75bbd9e43180ce811f,060372073012,7

我试过的

1.尝试1

^{pr2}$

Error msg:

cannot resolve 'item' given input columns: [id, cbgs, recom_item, recom_cnt];;

根据DrChess的建议,我尝试了下面的代码,但得到了空列表作为输出。在

fifa_df.withColumn("cbgs", F.from_json("cbgs", T.MapType(T.StringType(), T.IntegerType()))).select("id", F.explode(["visitor_home_cbgs"]).alias('cbgs', 'value')).show()






+------------------+----+-----+
|safegraph_place_id|cbgs|value|
+------------------+----+-----+
+------------------+----+-----+

Tags: 文件csvidjson示例字典value密钥
2条回答

首先需要将json解析为Map<String, Integer>,然后分解映射。你可以这样做:

import pyspark.sql.types as T
import pyspark.sql.functions as F

...

df2.withColumn("cbgs", F.from_json("cbgs", T.MapType(T.StringType(), T.IntegerType()))).select("id", F.explode("cbgs").alias('cbgs', 'value')).show()

以下是我所遵循的。这只涉及字符串处理操作,而不涉及复杂的数据类型处理。在

  1. escape选项读取源csv文件"df=spark.read.format('csv').option('header','True').option('escape','"')

|id                                 |cbgs                                                    |
+                 -+                            +
|sg:bd1f26e681264baaa4b44083891c886a|{"060372623011":166,"060372655203":70,"060377019021":34}|
|sg:04c7f777f01c4c75bbd9e43180ce811f|{"060372073012":7}                                      |
+                 -+                            +
  1. 第二列作为字符串而不是映射加载。现在splitdf=df.withColumn('cbgs',split(df['cbgs'],','))
^{pr2}$

3.稍后,爆炸。在

df=df.withColumn('cbgs',explode(df['cbgs']))

+                 -+         -+
|id                                 |cbgs               |
+                 -+         -+
|sg:bd1f26e681264baaa4b44083891c886a|{"060372623011":166|
|sg:bd1f26e681264baaa4b44083891c886a|"060372655203":70  |
|sg:bd1f26e681264baaa4b44083891c886a|"060377019021":34} |
|sg:04c7f777f01c4c75bbd9e43180ce811f|{"060372073012":7} |
+                 -+         -+
  1. 使用regex从cbgs列中提取值 ^{cd8}
+                 -+      +  -+
|id                                 |cbgs        |value|
+                 -+      +  -+
|sg:bd1f26e681264baaa4b44083891c886a|060372623011|166  |
|sg:bd1f26e681264baaa4b44083891c886a|060372655203|70   |
|sg:bd1f26e681264baaa4b44083891c886a|060377019021|34   |
|sg:04c7f777f01c4c75bbd9e43180ce811f|060372073012|7    |
+                 -+      +  -+
  1. 写入csv。在

相关问题 更多 >