我在Python(3.4)中使用Unicode代理项编码时遇到问题:
>>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed
如果我没有弄错,根据Python documentation:
'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.
代码应该只生成源序列(b'\xCC'
)。那么为什么会引发异常呢?
这可能与我的第二个问题有关:
Changed in version 3.4: The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800–U+DFFF) to be encoded.
(来自https://docs.python.org/3/library/codecs.html#standard-encodings)
据我所知,如果没有代理项对,就不可能将一些代码点编码到UTF-16。这背后的原因是什么?
目前没有回答
相关问题 更多 >
编程相关推荐