从二进制fi读取UTF8字符串

2条回答

网友

1楼 · 编辑于 2024-09-29 19:35:51

给定一个文件对象和若干字符，可以使用：

# build a table mapping lead byte to expected follow-byte count
# bytes 00-BF have 0 follow bytes, F5-FF is not legal UTF8
# C0-DF: 1, E0-EF: 2 and F0-F4: 3 follow bytes.
# leave F5-FF set to 0 to minimize reading broken data.
_lead_byte_to_count = []
for i in range(256):
    _lead_byte_to_count.append(
        1 + (i >= 0xe0) + (i >= 0xf0) if 0xbf < i < 0xf5 else 0)

def readUTF8(f, count):
    """Read `count` UTF-8 bytes from file `f`, return as unicode"""
    # Assumes UTF-8 data is valid; leaves it up to the `.decode()` call to validate
    res = []
    while count:
        count -= 1
        lead = f.read(1)
        res.append(lead)
        readcount = _lead_byte_to_count[ord(lead)]
        if readcount:
            res.append(f.read(readcount))
    return (''.join(res)).decode('utf8')

测试结果：

^{pr2}$

网友

2楼 · 编辑于 2024-09-29 19:35:51

UTF-8中的一个字符可以是1字节、2字节、3字节3。在

如果必须逐字节读取文件，则必须遵循UTF-8编码规则。http://en.wikipedia.org/wiki/UTF-8

大多数时候，您只需将编码设置为utf-8，然后读取输入流。在

你不需要关心你读了多少字节。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

从二进制fi读取UTF8字符串

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >