ValueError:在读取json文件时解码“字符串”时未配对的高代理

2条回答

网友

1楼 · 编辑于 2024-06-01 07:34:01

Unicode代码点U+D800may only occur as part of a surrogate pair（然后仅采用UTF-16编码）。因此JSON中的字符串（解码后）不是有效的UTF-8

JSON本身可能有效，也可能无效The spec没有提到不匹配代理项对的情况，但明确允许不存在代码点：

To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.
Note that the JSON grammar permits code points for which Unicode does not currently provide character assignments.

现在，你可以选择你的朋友，但你不能选择你的家人，你也不能总是选择你的朋友。所以下一个问题是：如何解析这个混乱

看起来Python（版本3.9）中的内置json模块和simplejson（版本3.17.2）解析JSON都没有问题。只有尝试使用字符串时，才会出现此问题。因此，这实际上与JSON没有任何关系：

>>> bork = '\ud800'
>>> bork
'\ud800'
>>> print(bork)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

幸运的是，我们可以手动编码字符串并告诉Python如何处理错误。例如，用问号替换错误的代码点：

>>> bork.encode('utf-8', errors='replace')
b'?'

文档中列出了errors参数的other possible options

为了修复这个断开的字符串，我们可以编码（到bytes），然后解码（回到str）：

>>> bork.encode('utf-8', errors='replace').decode('utf-8')
'?'

网友

2楼 · 编辑于 2024-06-01 07:34:01

孤立的Unicode surrogate与任何内容都不对应。每个有效的高代理代码点需要紧接着一个低代理代码点，然后才能对其进行有意义的解码

错误消息仅仅意味着此代码点没有明确定义的含义。这就像说“拿”而不说我们应该拿什么，或者“看”而不填句子的宾语

您不应该在不包含UTF-16的文件中使用代理项；它们是为这种编码严格保留的。它用于对16位空间之外的字符进行编码，这种16位编码可以通过在两个代码点上拆分字符的方式自然表示

简单而明显的解决办法是提供丢失的信息，但我们不知道它是什么。也许您有更多的上下文，可以使用正确的低代理项对进行填充。但例如，这是可行的：

>>> json.loads('{"":"\\ud800\\udc00"}')
{'': '𐀀'}

它用单个代码点U+010000填充JSON，但是我们当然不知道这是否是数据应该包含的代码点

相关问题更多 >

编程相关推荐

热门问题

热门文章