将海量文本转换为utf8

2条回答

网友

1楼 · 编辑于 2024-05-17 11:36:10

与其猜测编码，不如让chardet代表您猜测：

import chardet

def read(filename, encoding=None, min_confidence=0.5):
    """Return the contents of 'filename' as unicode, or some encoding."""
    with open(filename, "rb") as f:
        text = f.read()
    guess = chardet.detect(text)
    if guess["confidence"] < min_confidence:
        raise UnicodeDecodeError
    text = unicode(text, guess["encoding"])
    if encoding is not None:
        text = text.encode(encoding)
    return text

网友

2楼 · 编辑于 2024-05-17 11:36:10

为了将它转换成UTF-8，您需要知道它的编码方式。根据你的描述，我猜它是拉丁语1的变体之一，ISO 8859-1或Windows-1252。如果是这样，那么您可以将其转换为UTF-8，如下所示：

data = 'Copyright \xA9 2012'  # \xA9 is the copyright symbol in Windows-1252

# Convert from Windows-1252 to UTF-8
encoded = data.decode('Windows-1252').encode('utf-8')

# Prints "Copyright © 2012"
print encoded

相关问题更多 >

编程相关推荐

热门问题

热门文章

将海量文本转换为utf8

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >