擅长:python、mysql、java
<p>我做了一个粗糙的UTF-8解编器,它似乎解决了你混乱的编码情况:</p>
<pre><code>import codecs
import re
import json
def unmangle_utf8(match):
escaped = match.group(0) # '\\u00e2\\u0082\\u00ac'
hexstr = escaped.replace(r'\u00', '') # 'e282ac'
buffer = codecs.decode(hexstr, "hex") # b'\xe2\x82\xac'
try:
return buffer.decode('utf8') # '€'
except UnicodeDecodeError:
print("Could not decode buffer: %s" % buffer)
</code></pre>
<p>用法:</p>
^{pr2}$
<p>它使用regex从字符串中提取十六进制序列,将其转换为单个字节,并将其解码为UTF-8。在</p>
<p>对于上面的示例字符串(我包含了3字节字符<code>€</code>作为测试),这将打印:</p>
<pre>
Broken JSON
{"some_key": "... \u00e2\u0080\u0099 w\u0061x, and voila!\u00c2\u00a0\u00c2\u00a0At the moment you can't use our \u00e2\u0082\u00ac ..."}
Fixed JSON
{"some_key": "... ’ wax, and voila! At the moment you can't use our € ..."}
Parsed data
{'some_key': "... ’ wax, and voila!\xa0\xa0At the moment you can't use our € ..."}
Single value
... ’ wax, and voila! At the moment you can't use our € ...
</pre>
<p>“Parsed data”中的<code>\xa0</code>是由于Python将dict输出到控制台的方式造成的,它仍然是实际的非中断空间。在</p>