java Spark行级错误处理,如何获取行级错误消息
我有一个csv文件,我正在通过spark加载。我想区分好记录和坏记录,还想知道坏记录的每一行级别错误
我指定了一个模式,可以像这样捕获损坏的_记录,但是如何获取每个不同损坏记录的错误消息呢
--------------+-----------+----------+--------------------+-------+--------------------+
|service_point_number|energy_type|is_enabled| metadata|testint| _corrupt_record|
+--------------------+-----------+----------+--------------------+-------+--------------------+
| 90453512| E| false|Address1@420#Addr...| 23| null|
| 14802348| G| false|Address1@420#Addr...| 24| null|
| null| null| null| null| null|99944990,E,12,Add...|
| 78377144| E| false| 123| 26| null|
| 25506816| G| false|Address1@420#Addr...| 27| null|
| 48789905| E| true|Address1@420#Addr...| null|48789905,E,true,A...|
| 20283032| E| false|Address1@420#Addr...| 29| null|
| 67311231| G| false|Address1@420#Addr...| 30| null|
| 18240558| G| false|Address1@420#Addr...| 31|18240558,G,false,...|
| 42631153| E| false|Address1@420#Addr...| 32| null|
+--------------------+-----------+----------+--------------------+-------+--------------------+
# 1 楼答案
spark csv软件包集成在spark 2中。x提供 https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html
参见Possible to put records that aren't same length as header records to bad_record directory上的CSV示例