读取CSV时出现多个quotechars

2024-10-05 10:51:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试读取如下所示的CSV:

import pandas as pd
from io import StringIO

s = """
System Name,System ID,System Type,Flood Source,System Authorization,Rehabilitation Program Status,Responsible Organization
"Napa River, left bank above Tulocay Creek","5305000080","Channel","Napa River","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Napa River, right bank below Napa Creek","5305000050","Levee System","","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Needles "S" Street ","3805030008","Levee System",""S" Street Wash, Dead Mountain HA","USACE Federally constructed, turned over to public sponsor operations and maintenance","Inactive","USACE - Los Angeles District"
"Nevada County Levee 1","1905046000","Levee System","Donner Creek","Locally Constructed, Locally Operated and Maintained","Not Enrolled","California"
"Nevada Levee","7005000873","Levee System","","Other Federal Agency","Not Enrolled","Bureau of Reclamation"
"""

pd.read_csv(StringIO(s))

问题是"Needles "S" Street "有多个引号,结果是ParseError

ParserError: Error tokenizing data. C error: Expected 7 fields in line 5, saw 8

我尝试了this approach,但所有试图编写自己的分隔符的尝试都以拥有一个单列数据帧而告终。想法


Tags: andtopublicsystemoveroperationsriverconstructed
1条回答
网友
1楼 · 发布于 2024-10-05 10:51:14

引号内的引号必须用双引号“”转义,此行转义错误

"Needles "S" Street ","3805030008","Levee System",""S" Street Wash, Dead Mountain HA",...
         ^^^                                       ^^^

“S”必须在两处转义为“S”。第二位前面有一个引号,因此整个多行字符串必须用“”而不是“”引起来

import pandas as pd
from io import StringIO

s = '''System Name,System ID,System Type,Flood Source,System Authorization,Rehabilitation Program Status,Responsible Organization
"Napa River, left bank above Tulocay Creek","5305000080","Channel","Napa River","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Napa River, right bank below Napa Creek","5305000050","Levee System","","USACE Federally constructed, turned over to public sponsor operations and maintenance","Active","USACE - San Francisco District"
"Needles ""S"" Street ","3805030008","Levee System","""S"" Street Wash, Dead Mountain HA","USACE Federally constructed, turned over to public sponsor operations and maintenance","Inactive","USACE - Los Angeles District"
"Nevada County Levee 1","1905046000","Levee System","Donner Creek","Locally Constructed, Locally Operated and Maintained","Not Enrolled","California"
"Nevada Levee","7005000873","Levee System","","Other Federal Agency","Not Enrolled","Bureau of Reclamation"
'''

df = pd.read_csv(StringIO(s))
print(df)

输出:

                                 System Name  ...        Responsible Organization
0  Napa River, left bank above Tulocay Creek  ...  USACE - San Francisco District
1    Napa River, right bank below Napa Creek  ...  USACE - San Francisco District
2                        Needles "S" Street   ...    USACE - Los Angeles District
3                      Nevada County Levee 1  ...                      California
4                               Nevada Levee  ...           Bureau of Reclamation

如果您无法轻松修复数据,那么数据的快速修复方法是对数据中的所有合法引号进行编码,删除非法引号,然后重新引用数据

# 1. replace legal quotes with another symbol to replace back later
s = s.replace('\n"', "\n|").replace('"\n', '|\n')
s = s.replace('",', '|,').replace(',"', ',|')

# 2. remove all illegal quote characters in the data
s = s.replace('"', "")

# 3. re-quote the data
s = s.replace('|', '"')

或者只执行步骤1并更改pd.read_csv()调用的引号和分隔符。这将保留非法引号

s = s.replace('\n"', "\n|").replace('"\n', '|\n')
s = s.replace('",', '|,').replace(',"', ',|')

df = pd.read_csv(StringIO(s), delimiter=',', quotechar='|')

相关问题 更多 >

    热门问题