Python-使用BOM解码UTF-16文件

1条回答

网友

1楼 · 发布于 2024-09-29 21:23:11

首先，你应该在二进制模式下阅读，否则事情会变得混乱。

然后，检查并删除BOM，因为它是文件的一部分，而不是实际文本的一部分。

import codecs
encoded_text = open('dbo.chrRaces.Table.sql', 'rb').read()    #you should read in binary mode to get the BOM correctly
bom= codecs.BOM_UTF16_LE                                      #print dir(codecs) for other encodings
assert encoded_text.startswith(bom)                           #make sure the encoding is what you expect, otherwise you'll get wrong data
encoded_text= encoded_text[len(bom):]                         #strip away the BOM
decoded_text= encoded_text.decode('utf-16le')                 #decode to unicode

在完成所有解析/处理之前，不要进行编码（到utf-8或其他方式）。您应该使用unicode字符串来完成所有这些操作。

而且，errors='ignore'上的decode可能是个坏主意。想想更糟的是：让你的程序告诉你一些错误并停止，或者返回错误的数据？

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python-使用BOM解码UTF-16文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >