Facebook JSON编码错误

网友

1楼 · 编辑于 2024-06-26 14:13:26

我确实可以确认Facebook下载数据的编码不正确；Mojibake。原始数据是UTF-8编码的，但被解码为拉丁语-1。我一定要提交一份错误报告。在

同时，您可以通过两种方式修复损坏：

将数据解码为JSON，然后将任何字符串重新编码为Latin-1，再次解码为UTF-8：

>>> import json
>>> data = r'"Rados\u00c5\u0082aw"'
>>> json.loads(data).encode('latin1').decode('utf8')
'Radosław'

以二进制形式加载数据，将所有\u00hh序列替换为最后两个十六进制数字表示的字节，解码为UTF-8，然后解码为JSON：
^{pr2}$
根据您的示例数据，这将产生：
```
{'content': 'No to trzeba ostatnie treningi zrobić xD',
 'sender_name': 'Radosław',
 'timestamp': 1524558089,
 'type': 'Generic'}
```

网友

2楼 · 编辑于 2024-06-26 14:13:26

基于@Martijn Pieters解决方案，我用Java编写了类似的东西。在

public String getMessengerJson(Path path) throws IOException {
    String badlyEncoded = Files.readString(path, StandardCharsets.UTF_8);
    String unescaped = unescapeMessenger(badlyEncoded);
    byte[] bytes = unescaped.getBytes(StandardCharsets.ISO_8859_1);
    String fixed = new String(bytes, StandardCharsets.UTF_8);
    return fixed;
}

unescape方法的灵感来自org.apache.commons网站.语言字符串. 在

^{pr2}$

网友

3楼 · 编辑于 2024-06-26 14:13:26

我的解析对象的解决方案使用^{} callback on load/loads函数：

import json


def parse_obj(dct):
    for key in dct:
        dct[key] = dct[key].encode('latin_1').decode('utf-8')
        pass
    return dct


data = '{"msg": "Ahoj sv\u00c4\u009bte"}'

# String
json.loads(data)  
# Out: {'msg': 'Ahoj svÄ\x9bte'}
json.loads(data, object_hook=parse_obj)  
# Out: {'msg': 'Ahoj světe'}

# File
with open('/path/to/file.json') as f:
     json.load(f, object_hook=parse_obj)
     # Out: {'msg': 'Ahoj světe'}
     pass

更新：

用字符串分析列表的解决方案不起作用。以下是最新的解决方案：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章

Facebook JSON编码错误

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >