Python无法使用surrogatescap编码

2024-09-28 19:05:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我在Python(3.4)中使用Unicode代理项编码时遇到问题:

>>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed

如果我没有弄错,根据Python documentation

'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.

代码应该只生成源序列(b'\xCC')。那么为什么会引发异常呢?

这可能与我的第二个问题有关:

Changed in version 3.4: The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800–U+DFFF) to be encoded.

(来自https://docs.python.org/3/library/codecs.html#standard-encodings

据我所知,如果没有代理项对,就不可能将一些代码点编码到UTF-16。这背后的原因是什么?


Tags: theto代码in代理编码codebe