如何在python3.x中将损坏的文本视为数据

2024-09-28 21:37:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我的代码正在读取csv文件中的行,这些文件混合了ascii和oct。我试图恢复UTF-8中的原始文本,但是我遗漏了一些明显的东西。在

line = "Tom\303\241\305\241 Vala" #Tomáš Vala
print(a)
Tomáš Vala  #incorrect

如果我在命令行中手动键入该行,则结果是正确的:

^{pr2}$

但是如何打印已经以字节为单位的行呢?在

>>> a = "Tom\303\241\305\241 Vala" 
>>> print(a)
Tomáš Vala  #incorrect

>>> b = bytes(a, 'utf=8')
>>> b.decode('utf=8')
'Tomáš Vala' #incorrect

Tags: 文件csv代码命令行文本lineasciioct
1条回答
网友
1楼 · 发布于 2024-09-28 21:37:40

你需要翻译所有字面反斜杠转义序列。可以使用正则表达式执行此操作:

import re

seq = re.compile(br'\\[0-8]{3}')
decode_seq = lambda m: bytes([int(m.group()[1:], 8)])
def repair(data):
    return seq.sub(decode_seq, data)

这将解码bytes对象中的数据:

^{pr2}$

要包装现有文件,您必须实现^{} subclass,以便在读取时转换字节:

import re
from io import BufferedIOBase

class OctetEscapeDecodeWrapper(BufferedIOBase):
    def __init__(self, buffer):
        # we wrap a buffer, not a raw object, so don't use raw here.
        self._buffer = buffer
        self._remainder = b''

    def readable(self):
        return True

    def detach(self):
        result, self._buffer = self._buffer, None
        return result

    def _decode(self, data, 
                _seq=re.compile(br'\\[0-8]{3}'), 
                _decode=lambda m: bytes([int(m.group()[1:], 8)])):
        return _seq.sub(_decode, data)

    def read1(self, size=-1):
        self._remainder, data = b'', self._remainder + self._buffer.read1(size)
        trail = data.rfind(b'\\', -3)
        if trail > -1 and all(48 <= data[i] <= 57 for i in range(trail + 1, len(data))):
            # data ends \dd or \d, retain until next read so we can decode then
            self._remainder, data = data[trail:], data[:trail]
        return self._decode(data)

    read = read1

    def readinto1(self, b):
        data = self.read1(len(b))
        b[:len(data)] = data
        return len(data)

    readinto = readinto1

这可用于包装现有的二进制文件,以便实时解码数据:

import csv
from io import TextIOWrapper

with open(path_to_file, 'rb') as binary:
    text = TextIOWrapper(OctetEscapeDecodeWrapper(binary), encoding='utf8')
    reader = csv.reader(text)
    for row in reader:
        # ...

演示:

>>> from io import BytesIO, TextIOWrapper
>>> sample = BytesIO(b'Tom\303\241\305\241 Vala, V\303\241lec, 1.1.1984,')
>>> b = OctetEscapeDecodeWrapper(sample)
>>> t = TextIOWrapper(b, encoding='utf8')
>>> import csv
>>> next(csv.reader(t))
['Tomáš Vala', ' Válec', ' 1.1.1984', '']

相关问题 更多 >