Python-处理混合编码文件

2条回答

网友

1楼 · 编辑于 2024-06-28 19:11:54

如果您尝试将此字符串解码为utf-8，如您所知，您将得到一个“UnicodeDecode”错误，因为这些虚假的cp1252字符是无效的utf-8-

但是，Python codecs允许您使用codecs.register嫒error函数注册一个callback to handle encoding/decoding错误-它为UnicodeDecodeerror获取一个参数-您可以编写这样的处理程序，该处理程序将数据解码为“cp1252”，并在utf-8中继续对字符串的其余部分进行解码。

在我的utf-8终端中，我可以构建这样一个混合的错误字符串：

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma�� 
>>> a.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data

我在这里编写了上述回调函数，并发现了一个捕获：即使您将要解码字符串的位置增加1，以便它将从下一个chratcer开始，如果下一个字符也不是utf-8并且超出范围（128），则在第一个超出范围（128）字符处引发错误-这意味着，如果找到连续的非ascii、非utf-8字符。

解决方法是在error_处理程序中有一个状态变量，它检测到这个“后退”并从最后一次调用它时继续解码-在这个简短的例子中，我将它实现为一个全局变量（在每次调用解码器之前，它必须手动重置为“-1”）：

import codecs

last_position = -1

def mixed_decoder(unicode_error):
    global last_position
    string = unicode_error[1]
    position = unicode_error.start
    if position <= last_position:
        position = last_position + 1
    last_position = position
    new_char = string[position].decode("cp1252")
    #new_char = u"_"
    return new_char, position + 1

codecs.register_error("mixed", mixed_decoder)

在控制台上：

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã

网友

2楼 · 编辑于 2024-06-28 19:11:54

多亏了jsbueno和其他Google搜索，还有其他的重击，我这样解决了这个问题。

#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")

此版本允许有限的机会修复无效字符。未知字符将替换为安全值。

import codecs    
replacement = {
   '85' : '...',           # u'\u2026' ... character.
   '96' : '-',             # u'\u2013' en-dash
   '97' : '-',             # u'\u2014' em-dash
   '91' : "'",             # u'\u2018' left single quote
   '92' : "'",             # u'\u2019' right single quote
   '93' : '"',             # u'\u201C' left double quote
   '94' : '"',             # u'\u201D' right double quote
   '95' : "*"              # u'\u2022' bullet
}

#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
    errStr = unicodeError[1]
    errLen = unicodeError.end - unicodeError.start
    nextPosition = unicodeError.start + errLen
    errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
    if errHex in replacement:
        return u'%s' % replacement[errHex], nextPosition
    return u'%s' % errHex, nextPosition   # Comment this line out to get a question mark
    return u'?', nextPosition

codecs.register_error("mixed", mixed_decoder)

xmlText = xmlText.decode("utf-8", "mixed")

基本上我想把它变成utf8。对于任何失败的字符，我只需将其转换为十六进制，这样我就可以在自己的表中显示或查找它。

这不好看，但我能理解混乱的数据

相关问题更多 >

编程相关推荐

热门问题

热门文章