在什么样的世界里，\\u00c3\\u00a9会变成é？

2条回答

网友

1楼 · 编辑于 2024-10-01 13:44:07

这里有Mojibake data；用错误的编解码器从字节解码了UTF-8数据。在

诀窍是在生成JSON输出之前，找出使用了哪个编码进行解码的。如果假定编码是Windows代码页1252，则可以修复前两个示例：

>>> sample = u'''\
... d\u00c3\u00a9cor
... business\u00e2\u20ac\u2122 active accounts 
... the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label
... '''.splitlines()
>>> print sample[0].encode('cp1252').decode('utf8')
décor
>>> print sample[1].encode('cp1252').decode('utf8')
business’ active accounts

但此编解码器第三次失败：

^{pr2}$
前3个'weird'字节肯定是U+201C LEFT DOUBLE QUOTATION MARK代码点的CP1252 Mojibake:
>>> sample[2] u'the \xe2\u20ac\u0153Made in the USA\xe2\u20ac\x9d label' >>> sample[2][:22].encode('cp1252').decode('utf8') u'the \u201cMade in the USA'
所以另一个组合应该是U+201D RIGHT DOUBLE QUOTATION MARK，但是后一个字符会导致CP1252中通常不存在的UTF-8字节：
>>> u'\u201d'.encode('utf8').decode('cp1252') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table) UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined>
这是因为CP1252编解码器中没有hex 9D位置，但codepoint将其放入JSON输出：
>>> sample[2][22:] u'\xe2\u20ac\x9d label'
^a4}Ned批处理程序非常有用地提醒我使用一个“草率”的CP1252编解码器来解决这个问题，将不存在的字节一一映射（UTF-8字节到拉丁-1unicode点）。然后，库将生成的“花式引号”映射到ASCII引号，但您可以将其关闭：
>>> import ftfy >>> ftfy.fix_text(sample[2]) u'the "Made in the USA" label' >>> ftfy.fix_text(sample[2], uncurl_quotes=False) u'the \u201cMade in the USA\u201d label'
由于这个库为您自动化了这项任务，并且比标准的Python编解码器在这里做的更好，所以您应该安装它，并将其应用到这个API交给您的混乱中。尽管如此，如果你有一半的机会，不要犹豫地斥责那些给你这些数据的人。他们制造了一个可爱的烂摊子。在

网友
2楼 · 编辑于 2024-10-01 13:44:07

您应该尝试ftfy模块：
>>> print ftfy.ftfy(u"d\u00c3\u00a9cor") décor >>> print ftfy.ftfy(u"business\u00e2\u20ac\u2122 active accounts") business' active accounts >>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label") the "Made in the USA" label >>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label", uncurl_quotes=False) the “Made in the USA” label

相关问题更多 >

编程相关推荐

热门问题

热门文章