将pyspark dataframe中单元格中的CSV值与新列及其值分开

2024-09-30 06:16:42 发布

您现在位置:Python中文网/ 问答频道 /正文

当前的spark数据帧在一列的单元格级别具有CSV值,我试图将其分解为新列。示例数据帧

    a_id                                    features
1   2020     "a","b","c","d","constant1","1","0.1","aa"
2   2021     "a","b","c","d","constant2","1","0.2","ab"
3   2022     "a","b","c","d","constant3","1","0.3","ac","constant3","1.1","3.3","acx"
4   2023     "a","b","c","d","constant4","1","0.4","ad"
5   2024     "a","b","c","d","constant5","1","0.5","ae","constant5","1.2","6.3","xwy","a","b","c","d","constant5","2.2","8.3","bunr"
6   2025     "a","b","c","d","constant6","1","0.6","af"

features列有多个csv值,其中(a、b、c、d)作为标题,它们在某些单元格(第3行和第5行)中重复,我只想提取一个标题及其相应的值。预期数据帧的输出如图所示

应用拆分函数之前从链接使用的代码 Here

from pyspark.sql import functions as F

header='"a","b","c","d",'
num_headers = header.count(",")

df.withColumn("features", F.expr(f"replace(features, '{header}')")) \
  .withColumn("features", F.expr(f"regexp_extract_all(features, '(([^,]*,?)\\{{{num_headers}}})')")) \
  .withColumn("features", F.explode("features"))\
  .filter("not features =''") \
  .withColumn("features", F.split("features", ",")) \
  .withColumn("a", F.expr("features[0]")) \
  .withColumn("d", F.expr("features[3]")) \
  .groupBy("a_id") \
  .agg(F.first("a").alias("a"), F.collect_list("d").alias("d")) \
  .show(truncate=False)

如何在不计算标题数量的情况下进行编码,如图所示,如果有可能增加列,我希望避免硬编码(将标题分配给变量)。请对此给出见解

输出火花数据帧

    a_id       a        d
1   2020   constant1   ["aa"]
2   2021   constant2   ["ab"]
3   2022   constant3   ["ac","acx"]
4   2023   constant4   ["ad"]
5   2024   constant5   ["ae","xwy","bunr"]
6   2025   constant6   ["af"]

请在Sheet1 has Input Sheet2 has Output链接中找到我添加到谷歌工作表中的样本数据,以供参考。我希望这些数据有帮助


Tags: 数据id标题abacaaheaderfeatures
1条回答
网友
1楼 · 发布于 2024-09-30 06:16:42

仔细观察后,值列表没有按行分割,即(使用分隔符\n),并且该\n值位于行的结束值和新行的开始值之间的列表元素之一(例如"\"pir\"\n\"608abc\"")。CSV可能很棘手,但样本的优点是细胞值在"中。因此,使用以下步骤来清理、排序并最终将数据转换为所需格式,从而获得所需的结果:

NB.我注意到在共享的示例数据集中没有ad头,因此我将在步骤10中描述如何为任何所需的列执行此操作

步骤

  1. 列表列features使用concat_ws连接成一个字符串,因为使用\n的行不容易识别每个csv行。这些值使用|分隔符连接
  2. 当整个字符串位于features时,csv被\n字符使用split分割成行
  3. 标题被提取为此列表中的第一个元素。您声明每一行都有标题,并且会重复这些标题。稍后将删除重复的标题,但是此步骤对于识别标题很重要
  4. 下一步是将数据视为行,这是使用posexplode实现的。这会将列表拆分为col中具有值的行,并共享pos中的顺序或csv行号。这是在select中完成的
  5. 已从col(使用F.col("headers") != F.col("col"))中的行集中删除重复标题,并删除空行((F.length(F.col("col"))>0))。这更容易,因为headers在前面被提取到另一个名为headers的列中
  6. pos被重命名为row_num,因为这将告知我们正在处理csv中的哪一行
  7. 在另一个select中,posexplode用于将col中的单元格/列值拆分为不同的行,因为我们打算以这些行为中心。标题也是split,因为我们打算将其作为一个列表处理
  8. 使用分割单元格值中的pos检索每行中每列的关联标题(即单元格值),以索引标题列表,因为标题和单元格值将共享存储在pos中的相同列号
  9. 已从标头和单元格值中删除"字符
  10. 您可以使用isin根据所需的列进行筛选(例如,在Number,car,car_name上进行筛选)。如果删除此行/筛选器,您将获得共享google工作表中显示的所有列
  11. 数据分组在idrow_num
  12. header上进行了一次透视,以将每个单元格值放入一列中。使用的透视值是max(`col`),它将返回该idrow_num组中相应列的旧单元格值

谷歌表单中共享的初始数据

^{tb1}$

代码:

from pyspark.sql import functions as F

output_df = (
    # Step 1
    df.withColumn("features",F.concat_ws("|","features"))
    # Step 2
      .withColumn("features",F.split("features","\n"))
    # Step 3
      .withColumn("headers",F.col("features")[0])
    # Step 4
      .select(
          F.col("id"),
          F.posexplode("features"),
          F.col("headers")
      )
    # Step 5
      .where(
          (F.col("headers") != F.col("col")) & 
          (F.length(F.col("col"))>0)
       )
    # Step 6
       .withColumnRenamed("pos","row_num")
    # Step 7
       .select(
           F.col("id"),
           F.col("row_num"),
           F.posexplode(F.split("col","\|")),
           F.split("headers","\|").alias("header")
       )
    # Step 8
       .withColumn("header",F.col("header")[F.col("pos")])
    # Step 9
       .withColumn("header",F.regexp_replace("header",'"',""))
       .withColumn("col",F.regexp_replace("col",'"',""))
    # Step 10
       .where(F.col("header").isin(["Number","car","car_name"]))
    # Step 11 
       .groupBy("id","row_num")
    # Step 12
       .pivot("header")
       .agg(
           F.max(F.col("col"))
       )
       .orderBy("id","row_num") # ordering is optional here. Included for answer presentation
)

output_df.show(truncate=False)

带有过滤器“编号”、“车辆”、“车辆名称”的输出:

+ -+   -+   +    -+    +
|id |row_num|Number|car      |car_name|
+ -+   -+   +    -+    +
|1  |1      |608abc|ZZZZ-TM  |RES     |
|1  |2      |814abc|TRAC     |TRAC    |
|2  |1      |608abc|ZZZZ-TM  |RES     |
|2  |2      |814abc|TRAC     |TRAC    |
|3  |1      |740abc|TOPPS    |TOPPSPCS|
|3  |2      |814abc|TRAC     |TRAC    |
|4  |1      |205abc|ZZZZ-VERI|TRAC    |
|4  |2      |318abc|TRAC     |TRAC    |
|5  |1      |651abc|ZZZZ-TM  |RES     |
|5  |2      |701abc|OTHERS   |CONS    |
+ -+   -+   +    -+    +

不带过滤器的输出:

+ -+   -+   + -+   -+   +    +    -+    +   -+   -+  +  + -+  +  -+  -+  -+  -+  -+  -+  -+  -+  +  +  +  +  +  +  +  +   +   +         +   +  -+  +  +  + -+  -+  -+   -+   -+  +         +   +   -+   -+   -+   -+   -+   -+   -+   -+   +   +   +   +   +   +   +   +
|id |row_num|Number|acc|acc_cat|avg_ch|avg_port|car      |car_name|careFai|careInt|csm1|csm2|ctn|day1|day10|day11|day12|day13|day14|day15|day16|day17|day2|day3|day4|day5|day6|day7|day8|day9|ivrFai|ivrInt|last              |net   |ooids|pasC|piTE|piTS|pir|req_h|req_w|retlFai|retlInt|ss_c|ss_i              |total1|total10|total11|total12|total13|total14|total15|total16|total17|total2|total3|total4|total5|total6|total7|total8|total9|
+ -+   -+   + -+   -+   +    +    -+    +   -+   -+  +  + -+  +  -+  -+  -+  -+  -+  -+  -+  -+  +  +  +  +  +  +  +  +   +   +         +   +  -+  +  +  + -+  -+  -+   -+   -+  +         +   +   -+   -+   -+   -+   -+   -+   -+   -+   +   +   +   +   +   +   +   +
|1  |1      |608abc|   |OTHERS |      |        |ZZZZ-TM  |RES     |0      |0      |0   |0   |   |0   |0    |0    |0    |0    |0    |0    |0    |0    |0   |0   |0   |0   |0   |0   |0   |0   |0     |0     |2431.3522631166666|TM    |     |0   |0   |0   |0  |20   |7    |0      |0      |    |                  |0     |0      |0      |0      |0      |0      |0      |0      |0      |0     |0     |0     |0     |0     |0     |0     |0     |
|1  |2      |814abc|   |OTHERS |      |        |TRAC     |TRAC    |0      |0      |0   |0   |   |0   |0    |0    |0    |0    |0    |0    |0    |0    |0   |0   |0   |0   |0   |0   |0   |0   |0     |0     |                  |OTHERS|     |0   |0   |0   |0  |20   |7    |0      |0      |    |                  |0     |0      |0      |0      |0      |0      |0      |0      |0      |0     |0     |0     |0     |0     |0     |0     |0     |
|2  |1      |608abc|   |OTHERS |      |        |ZZZZ-TM  |RES     |0      |0      |0   |0   |   |0   |0    |0    |0    |0    |0    |0    |0    |0    |0   |0   |0   |0   |0   |0   |0   |0   |0     |0     |2431.3514778000003|TM    |     |0   |0   |0   |0  |20   |7    |0      |0      |    |                  |0     |0      |0      |0      |0      |0      |0      |0      |0      |0     |0     |0     |0     |0     |0     |0     |0     |
|2  |2      |814abc|   |OTHERS |      |        |TRAC     |TRAC    |0      |0      |0   |0   |   |0   |0    |0    |0    |0    |0    |0    |0    |0    |0   |0   |0   |0   |0   |0   |0   |0   |0     |0     |                  |OTHERS|     |0   |0   |0   |0  |20   |7    |0      |0      |    |                  |0     |0      |0      |0      |0      |0      |0      |0      |0      |0     |0     |0     |0     |0     |0     |0     |0     |
|3  |1      |740abc|   |OTHERS |      |        |TOPPS    |TOPPSPCS|0      |0      |0   |0   |   |0   |0    |0    |0    |0    |0    |0    |0    |0    |0   |0   |0   |0   |0   |0   |0   |0   |0     |0     |7.563553799999999 |TOPPS |     |0   |0   |0   |0  |19   |7    |0      |0      |    |                  |0     |0      |0      |0      |0      |0      |0      |0      |0      |0     |0     |0     |0     |0     |0     |0     |0     |
|3  |2      |814abc|   |OTHERS |      |        |TRAC     |TRAC    |0      |0      |0   |0   |   |0   |0    |0    |0    |0    |0    |0    |0    |0    |0   |0   |0   |0   |0   |0   |0   |0   |0     |0     |                  |OTHERS|     |0   |0   |0   |0  |19   |7    |0      |0      |    |                  |0     |0      |0      |0      |0      |0      |0      |0      |0      |0     |0     |0     |0     |0     |0     |0     |0     |
|4  |1      |205abc|   |OTHERS |      |        |ZZZZ-VERI|TRAC    |0      |0      |0   |0   |   |0   |0    |0    |0    |0    |0    |0    |0    |0    |0   |0   |0   |0   |0   |0   |0   |0   |0     |0     |278.06139585      |VERI  |     |0   |0   |0   |0  |19   |7    |0      |0      |1   |SMART             |0     |0      |0      |0      |0      |0      |0      |0      |0      |0     |0     |0     |0     |0     |0     |0     |0     |
|4  |2      |318abc|   |OTHERS |      |        |TRAC     |TRAC    |0      |0      |0   |0   |   |0   |0    |0    |0    |0    |0    |0    |0    |0    |0   |0   |0   |0   |0   |0   |0   |0   |0     |0     |                  |OTHERS|     |0   |0   |0   |0  |19   |7    |0      |0      |    |                  |0     |0      |0      |0      |0      |0      |0      |0      |0      |0     |0     |0     |0     |0     |0     |0     |0     |
|5  |1      |651abc|   |OTHERS |      |        |ZZZZ-TM  |RES     |0      |0      |0   |0   |   |0   |0    |0    |0    |0    |0    |0    |0    |0    |0   |0   |0   |0   |0   |0   |0   |0   |0     |0     |                  |TM    |     |0   |0   |0   |0  |20   |7    |0      |0      |1   |MOBP/FEATURE PHONE|0     |0      |0      |0      |0      |0      |0      |0      |0      |0     |0     |0     |0     |0     |0     |0     |0     |
|5  |2      |701abc|   |OTHERS |      |        |OTHERS   |CONS    |0      |0      |0   |0   |   |0   |0    |0    |0    |0    |0    |0    |0    |0    |0   |0   |0   |0   |0   |0   |0   |0   |0     |0     |                  |OTHERS|     |0   |0   |0   |0  |20   |7    |0      |0      |    |                  |0     |0      |0      |0      |0      |0      |0      |0      |0      |0     |0     |0     |0     |0     |0     |0     |0     |
+ -+   -+   + -+   -+   +    +    -+    +   -+   -+  +  + -+  +  -+  -+  -+  -+  -+  -+  -+  -+  +  +  +  +  +  +  +  +   +   +         +   +  -+  +  +  + -+  -+  -+   -+   -+  +         +   +   -+   -+   -+   -+   -+   -+   -+   -+   +   +   +   +   +   +   +   +

让我知道这是否适合你

相关问题 更多 >

    热门问题