<p>不知何故,Unicode转义字符串偏离了2000个十六进制。Unicode破折号和撇号是:</p>
<p><a href="https://www.fileformat.info/info/unicode/char/2014/index.htm" rel="nofollow noreferrer">Unicode Character 'EM DASH' (U+2014)</a></p>
<p>及</p>
<p><a href="https://www.fileformat.info/info/unicode/char/2019/index.htm" rel="nofollow noreferrer">Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)</a></p>
<p>因此,不管怎样,让我们修复它,即使错误在源(他们)而不是目标:</p>
<pre class="lang-py prettyprint-override"><code>import re
text = r'Conversely, companies that aren\u0019t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil \u0014 well, they usually end up facing extinction.'
pattern = r'\\u([0-9a-fA-F]{4})'
# used to indicate the end of the previous match
# to save the string parts that don't need character encoding
off = 0
# start with an empty string
s = r''
# find and iterate over all matches of \uHHHH where H is a hex digit
for u in re.finditer(pattern, text):
# append anything up to the unicode escape
s += text[off:u.start()]
# fix encoding mistake, unicode escapes are 2000 hex off the mark
# then append it
s += chr(int(u.group(1), 16) + 0x2000)
# set off to the end of the match
off = u.end()
# append everything from the last match to the end of the line
s += text[off:len(text)]
print(s)
</code></pre>
<p>打印出来</p>
<pre class="lang-none prettyprint-override"><code>Conversely, companies that aren’t sharp-eyed enough to see that their real Dumbwaiter Pitches are lame, tired, or just plain evil — well, they usually end up facing extinction.
</code></pre>
<p>请注意,尽管我很高兴地忽略了文本中可能存在的<code>\\u00xx</code>(反斜杠本身转义),但这是我留给您解决的问题。当然,文本中任何<em>正确的</em>Unicode转义也将被更改</p>