当前的spark数据帧在一列的单元格级别具有CSV值,我试图将其分解为新列。示例数据帧
a_id features
1 2020 "a","b","c","d","constant1","1","0.1","aa"
2 2021 "a","b","c","d","constant2","1","0.2","ab"
3 2022 "a","b","c","d","constant3","1","0.3","ac","constant3","1.1","3.3","acx"
4 2023 "a","b","c","d","constant4","1","0.4","ad"
5 2024 "a","b","c","d","constant5","1","0.5","ae","constant5","1.2","6.3","xwy","a","b","c","d","constant5","2.2","8.3","bunr"
6 2025 "a","b","c","d","constant6","1","0.6","af"
features列有多个csv值,其中(a、b、c、d)作为标题,它们在某些单元格(第3行和第5行)中重复,我只想提取一个标题及其相应的值。预期数据帧的输出如图所示
应用拆分函数之前从链接使用的代码 Here
from pyspark.sql import functions as F
header='"a","b","c","d",'
num_headers = header.count(",")
df.withColumn("features", F.expr(f"replace(features, '{header}')")) \
.withColumn("features", F.expr(f"regexp_extract_all(features, '(([^,]*,?)\\{{{num_headers}}})')")) \
.withColumn("features", F.explode("features"))\
.filter("not features =''") \
.withColumn("features", F.split("features", ",")) \
.withColumn("a", F.expr("features[0]")) \
.withColumn("d", F.expr("features[3]")) \
.groupBy("a_id") \
.agg(F.first("a").alias("a"), F.collect_list("d").alias("d")) \
.show(truncate=False)
如何在不计算标题数量的情况下进行编码,如图所示,如果有可能增加列,我希望避免硬编码(将标题分配给变量)。请对此给出见解
输出火花数据帧
a_id a d
1 2020 constant1 ["aa"]
2 2021 constant2 ["ab"]
3 2022 constant3 ["ac","acx"]
4 2023 constant4 ["ad"]
5 2024 constant5 ["ae","xwy","bunr"]
6 2025 constant6 ["af"]
请在Sheet1 has Input Sheet2 has Output链接中找到我添加到谷歌工作表中的样本数据,以供参考。我希望这些数据有帮助
仔细观察后,值列表没有按行分割,即(使用分隔符
\n
),并且该\n
值位于行的结束值和新行的开始值之间的列表元素之一(例如"\"pir\"\n\"608abc\""
)。CSV可能很棘手,但样本的优点是细胞值在"
中。因此,使用以下步骤来清理、排序并最终将数据转换为所需格式,从而获得所需的结果:NB.我注意到在共享的示例数据集中没有
a
和d
头,因此我将在步骤10中描述如何为任何所需的列执行此操作步骤
features
使用concat_ws
连接成一个字符串,因为使用\n
的行不容易识别每个csv行。这些值使用|
分隔符连接李>features
时,csv被\n
字符使用split
分割成行posexplode
实现的。这会将列表拆分为col
中具有值的行,并共享pos
中的顺序或csv行号。这是在select
中完成的col
(使用F.col("headers") != F.col("col")
)中的行集中删除重复标题,并删除空行((F.length(F.col("col"))>0)
)。这更容易,因为headers
在前面被提取到另一个名为headers
的列中李>pos
被重命名为row_num
,因为这将告知我们正在处理csv中的哪一行posexplode
用于将col
中的单元格/列值拆分为不同的行,因为我们打算以这些行为中心。标题也是split
,因为我们打算将其作为一个列表处理pos
检索每行中每列的关联标题(即单元格值),以索引标题列表,因为标题和单元格值将共享存储在pos
中的相同列号"
字符isin
根据所需的列进行筛选(例如,在Number,car,car_name
上进行筛选)。如果删除此行/筛选器,您将获得共享google工作表中显示的所有列id
和row_num
header
上进行了一次透视,以将每个单元格值放入一列中。使用的透视值是max(`col`)
,它将返回该id
和row_num
组中相应列的旧单元格值谷歌表单中共享的初始数据
代码:
带有过滤器“编号”、“车辆”、“车辆名称”的输出:
不带过滤器的输出:
让我知道这是否适合你
相关问题 更多 >
编程相关推荐