我有一个CSV文件,我想用py spark创建一个数据框,但无法这样做,因为有些行包含带有特殊字符的数据,而其中一半的列是双引号。下面是数据和我到目前为止所做的尝试
样本行
"ABG090D",2019-03-03 00:00:00.0000000,"A","some Data C\" AB01","Some Data","LOS","NEW",2019-04-11 00:00:00.0000000,"GHYTR","7860973478","0989","A",2019-03-03 00:00:00.0000000,"Y","N","N","N",1,"N","D016619",,"$,$#,&","Y",
"69901",,,,"FGF",89.00,"W",,"N","R","F",5.00,6.00,6.00,9.00,2.00,0,0,"9090",,"N",,,"1","N",,,"F",,2019-03-03 00:00:00.0000000,,,,,"N","A","N","N","N","N","N",,,,,,,"H",,,,,,,,,,"N","A","0","0","0",,0,0,0,0,0,0,0,"N","00","USA",
"C","I",0,,,,"FGF",0,,,"N","UOIU","5",,0,,0,0,,,"878","N",2019-04-11 09:44:00.0000000,"8980909","H",,,,"N","2","T","SomeData",
2020-03-12 09:24:52.0000000
在上述数据中,我面临的两个主要问题是:
1.“一些数据C\”AB01“=>;因为它包含反斜杠()以及引号(“),作为数据的一部分
2.“$,$,&;”=>;因为它包含逗号(,)作为数据的一部分
df = spark.read.option("quote","\"").option("escape","\"").option("escape","\\").option("delimiter" , ",").option("ignoreLeadingWhiteSpace", "true").csv("/path/file.csv",customSchema)
通过上面的代码,我能够解决一些数据C\'AB01,但是第二列即“$,$”和&;“在这里产生了一个问题
即使我试着使用下面链接中给出的答案,但它也不适用于我。 How to remove double quotes and extra delimiter(s) with in double quotes of TextQualifier file in Scala
在您的情况下,最好构建自己的解析器。我编写了一个简单的示例,如下所示,使用正则表达式解析文件并将值存储在
values
列表中希望这种方法对你有用
相关问题 更多 >
编程相关推荐