当前的spark数据帧在一列的单元格级别具有CSV值,我试图将其分解为新列。示例数据帧
a_id features
1 2020 "a","b","c","d","constant1","1","0.1","aa"
2 2021 "a","b","c","d","constant2","1","0.2","ab"
3 2022 "a","b","c","d","constant3","1","0.3","ac","a","b","c","d","constant3","1.1","3.3","acx"
4 2023 "a","b","c","d","constant4","1","0.4","ad"
5 2024 "a","b","c","d","constant5","1","0.5","ae","a","b","c","d","constant5","1.2","6.3","xwy","a","b","c","d","constant5","2.2","8.3","bunr"
6 2025 "a","b","c","d","constant6","1","0.6","af"
features列有多个csv值,其中(a、b、c、d)作为标题,它们在某些单元格(第3行和第5行)中重复,我只想提取一个标题及其相应的值。预期数据帧的输出如图所示
输出火花数据帧
a_id a d
1 2020 constant1 ["aa"]
2 2021 constant2 ["ab"]
3 2022 constant3 ["ac","acx"]
4 2023 constant4 ["ad"]
5 2024 constant5 ["ae","xwy","bunr"]
6 2025 constant6 ["af"]
如图所示,我只想将a和d标题提取为新列,其中a是常量,d有多个值,其中其值作为列表
请帮助如何在pysaprk中转换此文件。上述数据帧是实时流式数据帧
仅使用Pyspark/Spark SQL函数:
,
之后将字符串分成子字符串explode
返回结果并删除空行split
再次显示结果。现在,每个csv值都是数组的一个元素a
和d
a_id
输出:
相关问题 更多 >
编程相关推荐