如何转变&#xxx；字符的正常表示形式？

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" indent="yes"/> <xsl:template match="node()|@*"> <xsl:copy> <xsl:apply-templates select="node()|@*"/> </xsl:copy> </xsl:template> </xsl:stylesheet>

2条回答

网友
1楼 · 编辑于 2024-09-10 15:10:21

基本上,#1088这里1088是Unicode代码点。在python中，可以通过chr(integer value of Unicode code point)将Unicode代码点转换为实际表示形式
前面的b'<root...表示它是binary。所以我们需要使用.decode()将其转换为string
最后，我们可以使用regular expression通过- &#(\d{4});
&#：将匹配以&#
()：捕获组
\d{4}：选择长度为4的数字
;：以;
结尾
import re a = b'<root>Здраве</root>\n' ''.join([chr(int(i)) for i in re.findall(r'&#(\d{4});', a.decode())])
Здраве

网友
2楼 · 编辑于 2024-09-10 15:10:21

您可以使用html模块：
html.unescape('<root>Здраве</root>\n') '<root>Здраве</root>\n'
如果您正在接收字节，则需要首先将它们转换为字符串：
b = b'<root>Здраве</root>\n' html.unescape(b.decode('utf-8')) '<root>Здраве</root>\n'
您也可以尝试在对ET.tostring的调用中使用encoding='unicode'。它应该直接返回Python字符串，因为Python在内部对字符串使用unicode：
print(ET.tostring(newdom, encoding='unicode', pretty_print=True))

相关问题更多 >

编程相关推荐

热门问题

热门文章