我的一些列中有不需要的数据。如何摆脱它?

2024-09-24 02:21:53 发布

您现在位置:Python中文网/ 问答频道 /正文

正如您在下面的年龄性别列中所看到的,我有一些数据,而它的值应该是null或数字,为什么单元格会相互冲突?如何清洁我的专栏

据我所知,问题的根源是描述列,其中一些单元格显示为空/或数据显示带有一些非删除空格,而它们有数据,因此当我读取文件时,描述的内容显示在年龄和性别列中

df = sqlContext.read.csv("/FileStore/tables/mtmedical_V6-16623.csv", header=True)
df.show(150)

输出:

+--------------------+--------------------+--------------------+--------------------+-------------------------------------------------------+--------------------+--------------------+
|         description|   medical_specialty|                 age|              gender|sample_name (What has been done to patient = Treatment)|       transcription|            keywords|
+--------------------+--------------------+--------------------+--------------------+-------------------------------------------------------+--------------------+--------------------+
| A 23-year-old wh...| Allergy / Immuno...|                  23|              female|                                     Allergic Rhinitis |SUBJECTIVE:,  Thi...|allergy / immunol...|
| Consult for lapa...|          Bariatrics|                null|                male|                                    Laparoscopic Gas...|PAST MEDICAL HIST...|bariatrics, lapar...|
| Consult for lapa...|          Bariatrics|                  42|                male|                                    Laparoscopic Gas...|"HISTORY OF PRESE...| at his highest h...|
| 2-D M-Mode. Dopp...| Cardiovascular /...|                null|                null|                                    2-D Echocardiogr...|2-D M-MODE: , ,1....|cardiovascular / ...|
|  2-D Echocardiogram| Cardiovascular /...|                null|                male|                                    2-D Echocardiogr...|1.  The left vent...|cardiovascular / ...|
| Morbid obesity. ...|          Bariatrics|                  30|                male|                                    Laparoscopic Gas...|PREOPERATIVE DIAG...|bariatrics, gastr...|
| Liposuction of t...|                null|                null|                null|                                                   null|                null|                null|
|", Bariatrics,31,...|       1.  Deformity| right breast rec...|2.  Excess soft t...|                                    anterior abdomen...|3.  Lipodystrophy...|POSTOPERATIVE DIA...|
|  2-D Echocardiogram| Cardiovascular /...|                null|                male|                                    2-D Echocardiogr...|2-D ECHOCARDIOGRA...|cardiovascular / ...|
| Suction-assisted...|          Bariatrics|                null|                male|                                    Lipectomy - Abdo...|PREOPERATIVE DIAG...|bariatrics, lipod...|
| Echocardiogram a...| Cardiovascular /...|                null|                null|                                    2-D Echocardiogr...|DESCRIPTION:,1.  ...|cardiovascular / ...|
| Morbid obesity. ...|          Bariatrics|                  50|                male|                                    Laparoscopic Gas...|PREOPERATIVE DIAG...|bariatrics, morbi...|
| Normal left vent...| Cardiovascular /...|                null|                male|                                           2-D Doppler |2-D STUDY,1. Mild...|cardiovascular / ...|
| Cerebral Angiogr...|           Neurology|                  31|                male|                                      Moyamoya Disease |"CC:, Confusion a...| she was found ""...|

This is how the csv file looks like


Tags: csv数据nullmalediaggas年龄cardiovascular
1条回答
网友
1楼 · 发布于 2024-09-24 02:21:53

另一种方法是映射数据帧并删除“坏行”。但是,如果您要获得几个这样的csv文件,那么这将不是一个非常可扩展的过程

第二种方法是清理csv文件本身。在我看来,该文件的选项卡或空间不正确,可能会有问题

最后,您可以尝试以下方法

val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.csv("/FileStore/tables/mtmedical_V6-16623.csv")

这将消除带有多个换行符的文本内容,这可能是这里的问题

相关问题 更多 >