将Unicode码位号转换为Unicode字符

def ParseString2Unicode(sInString): """Return a version of sInString in which any Unicode code points of the form \uXXXX (X = hex digit) have been converted into their corresponding Unicode characters. Example: "\u0064b\u0065" becomes "dbe" """ sOutString = "" while sInString: if len(sInString) >= 6 and \ sInString[0] == "\\" and \ sInString[1] == "u" and \ sInString[2] in "0123456789ABCDEF" and \ sInString[3] in "0123456789ABCDEF" and \ sInString[4] in "0123456789ABCDEF" and \ sInString[5] in "0123456789ABCDEF": #If we get here, the first 6 characters of sInString represent # a Unicode code point, like "\u0065"; convert it into a char: sOutString += chr(int(sInString[2:6], 16)) sInString = sInString[6:] else: #Strip a single char: sOutString += sInString[0] sInString = sInString[1:] return sOutString

2条回答

网友

1楼 · 编辑于 2024-09-30 14:37:34

一种简洁、灵活的处理方法是使用正则表达式：

return re.sub(
    r"\\u([0-9A-Fa-f]{4})",
    lambda m: chr(int(m[1], 16)),
    sInString
)

网友

2楼 · 编辑于 2024-09-30 14:37:34

您可能想看看raw_unicode_escape编码。在

>>> len(b'\\uffff')
6
>>> b'\\uffff'.decode('raw_unicode_escape')
'\uffff'
>>> len(b'\\uffff'.decode('raw_unicode_escape'))
1

因此，函数是：

^{pr2}$

但是，这也匹配其他unicode转义序列，比如\Uxxxxxxxx。如果只想匹配\uxxxx，请使用regex，如下所示：

import re

escape_sequence_re = re.compile(r'\\u[0-9a-fA-F]{4}')

def _escape_sequence_to_char(match):
    return chr(int(match[0][2:], 16))

def ParseString2Unicode(sInString):
    return re.sub(escape_sequence_re, _escape_sequence_to_char, sInString)

相关问题更多 >

编程相关推荐

热门问题

热门文章