我继承了几百个csv,我想导入pandas数据帧。它们的格式如下:
username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink
;2011-03-02 11:04;0;0;"ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...";;;;"42993734165594112";https://twitter.com/AustinScottGA08/status/42993734165594112
;2014-02-25 10:38;3;0;"Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable.";;;#IRS;"438352361812426752";https://twitter.com/AnderCrenshaw/status/438352361812426752
;2017-06-14 12:39;4;6;"Thank you to the brave men and women who have answered the call to defend our great nation. Happy 242nd Birthday @USArmy ! #ArmyBDay pic.twitter.com/brBYCOLBJZ";;@USArmy;#ArmyBDay;"875045042758369281";https://twitter.com/AustinScottGA08/status/875045042758369281
为了将其放入熊猫数据框,我尝试:
tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True)
得到了这个错误:
ParserError: Error tokenizing data. C error: Expected 10 fields in line 1, saw 11
我想那是因为在这个领域里有一个未经转义的引语
ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...
所以,我试过了
tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE)
然后得到一个新的错误(我假设是因为字段中有;)
Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable. http:// tinyurl.com/n8ozeg5
ParserError: Error tokenizing data. C error: Expected 10 fields in line 2, saw 11
我无法重新生成这些CSV文件。我想知道的是,如何对它们进行预处理/修复,以使它们格式正确(即,字段中的转义引号)?或者,有没有一种方法可以直接将它们读入数据帧中,甚至使用未转义的引号?在
我会先清理数据,然后再读懂熊猫。这是我对你目前问题的解决办法。
编辑时间:
这将用双引号替换
;
(基于this答案)原件:
^{pr2}$相关问题 更多 >
编程相关推荐