文件包含\u00c2\u00a0，转换为字符

{ "xxxx1": "...You don\u2019t nee...", "xxxx2": "...Gu\u00e9rer...", "xxxx3": "...boost.\u00a0Sit back an....", "xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"", "xxxx5": "\u00a0\n\u00a0", "xxxx6": "It was Christmas Eve babe\u2026", "xxxx7": "It\u2019s xxx xxx\u2026" }

import json import re import codecs def load(): epos_export = r'{"xxxx1": "...You don\u2019t nee...","xxxx2": "...Gu\u00e9rer...","xxxx3": "...boost.\u00a0Sit back an....","xxxx4": "\" \u306f\u3058\u3081\u307e\u3057\u3066\"","xxxx5": "\u00a0\n\u00a0","xxxx6": "It was Christmas Eve babe\u2026","xxxx7": "It\u2019s xxx xxx\u2026"}' x = json.loads(re.sub(r"(?i)(?:\\u00[0-9a-f]{2})+", unmangle_utf8, epos_export)) with open("TEST.json", "w") as file: json.dump(x,file) def unmangle_utf8(match): escaped = match.group(0) # '\\u00e2\\u0082\\u00ac' hexstr = escaped.replace(r'\u00', '') # 'e282ac' buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac' try: return buffer.decode('utf8') # '€' except UnicodeDecodeError: print("Could not decode buffer: %s" % buffer) if __name__ == '__main__': load()

3条回答

网友

1楼 · 编辑于 2024-06-26 00:08:36

当您试图在一个名为TEST.json的文件中写入此内容时，我将假定此字符串是较大json字符串的一部分。在

让我举一个完整的例子：

js = '''{"a": "and voila!\\u00c2\\u00a0At the moment you can't use our"}'''
print(js)

{"a": "and voila!\u00c2\u00a0At the moment you can't use our"}

我首先用json加载它：

^{pr2}$

好了，这看起来像是一个utf-8字符串，它被错误地解码为Latin1。让我们反向操作：

x['a'] = x['a'].encode('latin1').decode('utf8')
print(x)
print(x['a'])

{'a': "and voila!\xa0At the moment you can't use our"}
and voila! At the moment you can't use our

好了，现在可以把它转换回正确的json字符串：

print(json.dumps(x))

{"a": "and voila!\\u00a0At the moment you can\'t use our"}

表示正确编码的不间断空格（U+00A0）

你应该做的是：

# load the string as json:
js = json.loads(request)

# identify the string values in the json - you probably know how but I don't...
...

# convert the strings:
js[...] = js[...].encode('latin1').decode('utf8')

# convert back to a json string
request = json.dumps(js)

网友

2楼 · 编辑于 2024-06-26 00:08:36

我做了一个粗糙的UTF-8解编器，它似乎解决了你混乱的编码情况：

import codecs
import re
import json

def unmangle_utf8(match):
    escaped = match.group(0)                   # '\\u00e2\\u0082\\u00ac'
    hexstr = escaped.replace(r'\u00', '')      # 'e282ac'
    buffer = codecs.decode(hexstr, "hex")      # b'\xe2\x82\xac'

    try:
        return buffer.decode('utf8')           # '€'
    except UnicodeDecodeError:
        print("Could not decode buffer: %s" % buffer)

用法：

^{pr2}$

它使用regex从字符串中提取十六进制序列，将其转换为单个字节，并将其解码为UTF-8。在

对于上面的示例字符串（我包含了3字节字符€作为测试），这将打印：

Broken JSON
 {"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."}
Fixed JSON
 {"some_key": "... ’ wax, and voila!  At the moment you can't use our € ..."}
Parsed data
 {'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."}
Single value
 ... ’ wax, and voila!  At the moment you can't use our € ...

“Parsed data”中的\xa0是由于Python将dict输出到控制台的方式造成的，它仍然是实际的非中断空间。在

网友

3楼 · 编辑于 2024-06-26 00:08:36

黑客的方法是去除编码的外层：

import re
# Assume export is a bytes-like object
export = re.sub(b'\\\u00([89a-f][0-9a-f])', lambda m: bytes.fromhex(m.group(1).decode()), export, flags=re.IGNORECASE)

这将匹配转义的UTF-8字节，并将它们替换为实际的UTF-8字节。将生成的类似对象的字节写入磁盘（无需进一步解码！）应生成有效的UTF-8json文件。在

当然，如果文件包含UTF-8范围内的真正的转义unicode字符，例如\u00e9表示重音“e”，则此操作将中断。在

相关问题更多 >

编程相关推荐

热门问题

热门文章