CSV中无效行的疑难解答问题的回答

CSV中无效行的疑难解答

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我使用的是一个非常大的CSV文件（接近6gb），它绝对是充满了错误。例如，如果我有以下csv文件/表： <pre><code>+------------+-------------+------------+ | ID | Date | String | +------------+-------------+------------+ | 123456 | 09-20-2019 | ABCDEFG | | 123abc456 | 10-30-2019 | HIJKLMN | | 7891011 | jdqhouehwf | OPQRSTU | | 1010101 | 03-15-2018 | 8473737 | | 4823.00 | 02-11-2015 | VWXYZ | | 2348813.0 | 01-23-2016 | BAZ | +------------+-------------+------------+ </code></pre> 或： <pre><code>"ID","Date","String" 123456,"09-20-2019","ABCDEFG" 123abc456,"10-30-2019","HIJKLMN" 7891011,"jdqhouehwf","OPQRSTU" 1010101,"03-15-2018",8473737 4823.00,"02-11-2015","VWXYZ" "2348813.0","01-23-2016","BAZ" </code></pre> 我想要一个很好的方法来解决问题和修复文件。使用熊猫，我可以读入文件： <pre><code>import pandas as pd df = pd.read_csv(inputfile) </code></pre> 熊猫总是会抱怨： <code>sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False</code> 所以我要清理每一列。但是，由于它是一个非常大的文件，我不能只打印我的整个表输出屏蔽和期望读取它。我想要一个简单的方法来获取一个列并检查它是否符合类型。另外，如果可能的话，我还需要一种删除坏行和/或将行转换为正确格式的方法。说到底，我希望文件看起来像（不包括内联注释）： <pre><code>"ID","Date","String" 123456,"09-20-2019","ABCDEFG" # 123abc456,"10-30-2019","HIJKLMN" was deleted because the ID wasn't a number # 7891011,"jdqhouehwf","OPQRSTU" was deleted because the data was not a date 1010101,"03-15-2018","8473737" # The last number could be converted to string 4823,"02-11-2015","VWXYZ" # The first number could be converted to integer 2348813,"01-23-2016","BAZ" # The ID number could be converted to int </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

CSV中无效行的疑难解答

1 个回答

相关Python问题