使用Python3处理UTF8文件中的编码错误

2条回答

网友

1楼 · 编辑于 2024-09-25 00:22:25

这里有一个可行但不是很优雅的解决方案

# Read in file as a raw byte-string
fn  = 'bad_chars.txt'
with open(fn, 'rb') as f:
    text = f.read()
print(text)

# Detect out of range 
has_bad = False
for c in text:
    if c >= 128:
        has_bad = True
print('Had bad:', has_bad)

# Fix offending characters
text = text.replace(b'\xc2\x92', b"\x27")
text = text.replace(b'\xc2\x85', b"...")
text = text.decode('utf-8')
print(text)

这将产生以下输出

b'# ::snt That\xc2\x92s what we\xc2\x92re with\xc2\x85You\xc2\x92re not sittin\xc2\x92 there in a back alley and sayin\xc2\x92 hey what do you say, five bucks?\n'

Had bad: True

# ::snt That's what we're with...You're not sittin' there in a back alley and sayin' hey what do you say, five bucks?

缺点是我需要找到有问题的字符，并编写一个replace命令使其工作。类似问题中的可能替换代码表位于efficiently replace bad characters

网友

2楼 · 编辑于 2024-09-25 00:22:25

使用您答案中的原始数据，您已经从双重编码中获得了mojibake。你需要双重解码才能正确翻译

>>> s = b'# ::snt That\xc2\x92s what we\xc2\x92re with\xc2\x85You\xc2\x92re not sittin\xc2\x92 there in a back alley and sayin\xc2\x92 hey what do you say, five bucks?\n'
>>> s.decode('utf8').encode('latin1').decode('cp1252')
'# ::snt That’s what we’re with…You’re not sittin’ there in a back alley and sayin’ hey what do you say, five bucks?\n'

数据实际上是UTF-8格式，但在解码为Unicode时，错误的代码点是Windows-1252代码页的字节。.encode('latin1')将Unicode代码点1:1转换回字节，因为latin1编码是Unicode的前256个代码点，所以它可以作为Windows-1252正确解码

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用Python3处理UTF8文件中的编码错误

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >