如何在单个字段中使用多个带引号的分隔符读取csv？

import csv text = '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0,"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0' next(csv.reader(StringIO(text), delimiter=",", quotechar='"', quoting=csv.QUOTE_NONE))

3条回答

网友

1楼 · 编辑于 2024-10-01 04:45:33

如果结构始终相同，且逗号夹在整数和'"'之间，则可以使用正则表达式：

import re

re.split('(?<=[0-9]),(?=")', text)

网友

2楼 · 编辑于 2024-10-01 04:45:33

数据采用非标准格式，因此任何解决方案都需要在完整数据集上进行测试。一种可能的解决方法是首先用;"替换,"字符，然后简单地在;上拆分它。这可以在不使用CSV或RE的情况下完成：

tests = [
    '"a,b"-"c,d","a,b"-"c,d"',
    '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0,"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0',
]

for test in tests:
    row = test.replace(',"' , ';"').split(';')
    print(len(row), row)

给予：

2 ['"a,b"-"c,d"', '"a,b"-"c,d"']
2 ['"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0', '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'

网友

3楼 · 编辑于 2024-10-01 04:45:33

我只回答你问题的第一部分：内置的csv模块无法做到这一点

查看CPython源代码，quotechar选项在字段开头是only processed：

    case START_FIELD:
        /* expecting field */
        ...
        else if (c == dialect->quotechar &&
                 dialect->quoting != QUOTE_NONE) {
            /* start quoted field */
            self->state = IN_QUOTED_FIELD;
        }
        ...
        break;

在字段中，there is no such check：

    case IN_FIELD:
        /* in unquoted field */
        if (c == '\n' || c == '\r' || c == '\0') {
            /* end of line - return [fields] */
            if (parse_save_field(self) < 0)
                return -1;
            self->state = (c == '\0' ? START_RECORD : EAT_CRNL);
        }
        else if (c == dialect->escapechar) {
            /* possible escaped character */
            self->state = ESCAPED_CHAR;
        }
        else if (c == dialect->delimiter) {
            /* save field - wait for new field */
            if (parse_save_field(self) < 0)
                return -1;
            self->state = START_FIELD;
        }
        else {
            /* normal character - save in field */
            if (parse_add_char(self, module_state, c) < 0)
                return -1;
        }
        break;

当解析器处于IN_QUOTED_FIELD状态时，检查quotechar；然而，当遇到引号时，它会返回到IN_FIELD状态，表明我们在一个未引用的字段中。所以这是可能的：

>>> import csv
>>> import io
>>> print(next(csv.reader(io.StringIO('"a,b"cd,e'))))
['a,bcd', 'e']

但一旦解析器到达初始引用部分的末尾，它将考虑任何后续引用作为数据的一部分。我不知道这种行为是否符合任何（书面或非书面）CSV规范，或者它是否只是一个bug

相关问题更多 >

编程相关推荐

热门问题

热门文章