在Pandas中使用read_csv处理不需要的换行

2条回答

网友

1楼 · 编辑于 2024-09-20 22:54:34

你可以做一些预处理来消除不必要的中断。下面是我测试的例子。在

import fileinput

with fileinput.FileInput('input.csv', inplace=True, backup='.orig.bak') as file:
    for line in file:
        print(line.replace('\n','^'), end='')

with fileinput.FileInput('input.csv', inplace=True, backup='.1.bak') as file:
    for line in file:
        print(line.replace('^~','~'), end='')

with fileinput.FileInput('input.csv', inplace=True, backup='.2.bak') as file:
    for line in file:
        print(line.replace('^','\n'), end='')

网友

2楼 · 编辑于 2024-09-20 22:54:34

正确的方法是在创建时修复文件。如果这不可能，您可以预处理文件或使用包装器。在

下面是一个使用字节级包装器的解决方案，该包装器将行合并，直到获得正确数量的分隔符。我使用字节级包装器来利用io模块的类，并尽可能少地添加自己的代码：RawIOBase从底层字节文件对象读取行，并组合行以获得预期数量的分隔符（仅重写readinto和{}）

class csv_wrapper(io.RawIOBase):
    def __init__(self, base, delim):
        self.fd = base           # underlying (byte) file object
        self.nfields = None
        self.delim = ord(delim)  # code of the delimiter (passed as a character)
        self.numl = 0            # number of line for error processing
        self._getline()          # load and process the header line
    def _nfields(self):
        # number of delimiters in current line          
        return len([c for c in self.line if c == self.delim])

    def _getline(self):
        while True:
            # loads a new line in the internal buffer
            self.line = next(self.fd)
            self.numl += 1
            if self.nfields is None:           # store number of delims if not known
                self.nfields = self._nfields()
            else:
                while self.nfields > self._nfields():  # optionaly combine lines
                    self.line = self.line.rstrip() + next(self.fd)
                    self.numl += 1
            if self.nfields != self._nfields():        # too much here...
                print("Too much fields line {}".format(self.numl))
                continue               # ignore the offending line and proceed
            self.index = 0                             # reset line pointers
            self.linesize = len(self.line)
            break
    def readinto(self, b):
        if len(b) == 0: return 0
        if self.index == self.linesize:            # if current buffer is exhausted
            try:                                   # read a new one
                self._getline()
            except StopIteration:
                return 0
        for i in range(len(b)):                    # store in passed bytearray
            if self.index == self.linesize: break
            b[i] = self.line[self.index]
            self.index += 1
        return i
    def readable(self):
        return True

然后可以将代码更改为：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章

在Pandas中使用read_csv处理不需要的换行

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >