如何使用Python解码这个utf-8字符串，它是在一个随机的网站上挑选出来的，由Django ORM保存的？

3条回答

网友

1楼 · 编辑于 2024-10-02 20:36:58

你很好。你有正确的数据。是的，原始数据是UTF-8（基于上下文u2019，作为“s on”和“s”之间的撇号是完全有意义的）。奇怪的?错误字符可能只是意味着您的终端配置的字体没有此字符的标志符号（花式撇号）。没什么大不了的。数据在计算的地方是正确的。如果您感到紧张，可以尝试一些不同的终端/操作系统组合（我在使用iTerm的OS X上）。我花了很多时间向我的QA人员解释，可怕的问号字符只意味着他们的windows框上没有安装中文字体（在我的例子中，我们是用中文数据测试的）。以下是一些评论

#Create a Python Unicode object
#(abstract code points, independent of any encoding)
#single backslash tells python we want to represent
#a code point by its unicode code point number, typed out with ASCII numbers
>>> s1 = u'his son\u2019s friend'

#If you just type it at the prompt,
#the interpreter does the equivalent of `print repr(s1)`
#and since repr means "show it like a string typed into a python source file",
#you get your ASCII escaped version back
>>> s1
u'his son\u2019s friend'
>>> print repr(s1)
u'his son\u2019s friend'

#This isn't ASCII, so encoding into ASCII generates your original
#error as expected
>>> s1.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character
 u'\u2019' in position 7: 
ordinal not in range(128)

# Encode in UTF-8 and now we have a string,
# which gets displayed as hex escapes.     
#Unicode code point 2019 looks like it gets 3 bytes in UTF-8 (yup, it does)
>>> s1.encode('utf-8')
'his son\xe2\x80\x99s friend'

#My terminal DOES have a different glyph (symbol) to use here,
#so it displays OK for me.
#Note that my terminal has a different glyph for a normal ASCII apostrophe
#(straight vertical)
>>> print s1
his son’s friend
>>> repr(s1)
"u'his son\\u2019s friend'"
>>> str(s1.encode('utf-8'))
'his son\xe2\x80\x99s friend'

另请参见：http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html

另请参阅字符2019（e28099，十六进制，在此页上搜索“2019”：http://www.utf8-chartable.de/unicode-utf8-table.pl?start=8000

另请参见：http://www.joelonsoftware.com/articles/Unicode.html

网友

2楼 · 编辑于 2024-10-02 20:36:58

也许我太天真了，但是。。。您的问题不只是转义了unicode代码点的前导\吗？

原始字符串的行为如下：

>>> s = u'his son\\u2019s friend'
>>> print(s)
his son\u2019s friend

但是删除转义\会得到：

>>> s = u'his son\u2019s friend'
>>> print(s)
his son’s friend

网友

3楼 · 编辑于 2024-10-02 20:36:58

尝试调用如下python shell：

python2 -S -i -c 'import sys;sys.setdefaultencoding("utf-8");import site'

然后：

>>> s = u'his son\u2019s friend'
>>> print s.encode("utf-8")
his son’s friend

那么默认的编码是utf-8，它应该打印得很好。

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用Python解码这个utf-8字符串，它是在一个随机的网站上挑选出来的，由Django ORM保存的？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >