Python、unicodedata名称和codepoint值，我缺少什么？

#!/bin/env python # -*- coding: utf-8 -*- #extracted and licensing from here: """ :author: Laurent Pointal <laurent.pointal@limsi.fr> <laurent.pointal@laposte.net> :organization: CNRS - LIMSI :copyright: CNRS - 2004-2009 :license: GNU-GPL Version 3 or greater :version: $Id$ """ # Chars alonemarks: # !?¿;,*¤@°:%|¦/()[]{}<>«»´`¨&~=#±£¥$©®" # must have spaces around them to make them tokens. # Notes: they may be in pchar or fchar too, to identify punctuation after # a fchar. # \202 is a special , # \226 \227 are special - alonemarks = u"!?¿;,\202*¤@°:%|¦/()[\]{}<>«»´`¨&~=#±\226"+\ u"\227£¥$©®\"" import unicodedata for x in alonemarks: unicodename = unicodedata.name(x, '<unknown>') print "\t".join(map(unicode, (x, len(x), ord(x), unicodename, unicodedata.category(x)))) # unichr(int('fd9b', 16)).encode('utf-8') # http://stackoverflow.com/questions/867866/convert-unicode-codepoint-to-utf8-hex-in-python

2条回答

网友

1楼 · 编辑于 2024-09-28 23:21:54

根据unicodedata库documentation

The module uses the same names and symbols as defined by the UnicodeData File Format 5.2.0 (see here)

您的两个字符将显示以下输出：

1   150 <unknown>   Cc
1   151 <unknown>   Cc

它们对应于控制点字符0x96和0x97 上面的unicode文档在the code point paragraph中规定：

Surrogate code points, private-use characters, control codes, noncharacters, and unassigned code points have no names.

我不知道如何通过unicodedata模块获得与unicode注释相对应的标签注释，但我认为您无法获得两个控制字符的任何名称，因为它是由unicode规范定义的。在

网友

2楼 · 编辑于 2024-09-28 23:21:54

i thought that every characters inside unicode database were named

不，控制字符没有名称，请参见UnicodeData文件

another question, is how can i get the unicode code point value ? is it ORD(unicodechar)

是的！在

print '%x' % ord(unicodedata.lookup('LATIN LETTER SMALL CAPITAL Z'))
## 1d22

相关问题更多 >

编程相关推荐

热门问题

热门文章