如何解码字符串中的unicode字符?

2024-10-02 04:31:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下字符串:

Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.

此字符串包含“\u0019t”。我无法解码,因为它已经是一个字符串了。如果我先编码,然后解码,它仍然显示“\u0019t”。我怎样才能让它显示一个'


Tags: to字符串that解码realcompaniesseetheir
2条回答

一种选择是对其进行文字评估:

import ast
s = r"Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction. \u2661"
r = ast.literal_eval(f'"{s}"')
print(r)

输出:

Conversely, companies that arent sharp-eyed enoughto see that their real Dumbwaiter Pitches are lame, tired, or just plain evil  well, they usually endup facing extinction. ♡

不知何故,Unicode转义字符串偏离了2000个十六进制。Unicode破折号和撇号是:

Unicode Character 'EM DASH' (U+2014)

Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)

因此,不管怎样,让我们修复它,即使错误在源(他们)而不是目标:

import re
text = r'Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.'
pattern = r'\\u([0-9a-fA-F]{4})'

# used to indicate the end of the previous match
# to save the string parts that don't need character encoding
off = 0
# start with an empty string
s = r''
# find and iterate over all matches of \uHHHH where H is a hex digit
for u in re.finditer(pattern, text):
    # append anything up to the unicode escape
    s += text[off:u.start()]
    # fix encoding mistake, unicode escapes are 2000 hex off the mark
    # then append it
    s += chr(int(u.group(1), 16) + 0x2000)
    # set off to the end of the match
    off = u.end()
# append everything from the last match to the end of the line
s += text[off:len(text)]
print(s)

打印出来

Conversely, companies that aren’t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil — well, they usually end up facing extinction.

请注意,尽管我很高兴地忽略了文本中可能存在的\\u00xx(反斜杠本身转义),但这是我留给您解决的问题。当然,文本中任何正确的Unicode转义也将被更改

相关问题 更多 >

    热门问题