Unicode正在消失html.pars文件

2024-04-26 04:36:18 发布

男 | 程序猿一只，喜欢编程写python代码。

我从一些带有Unicode字符的网页中提取HTML，如下所示：

def extract(url):
     """ Adapted from Python3_Google_Search.py """
     user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
                   "AppleWebKit/525.13 (KHTML,     like Gecko)"
                   "Chrome/0.2.149.29 Safari/525.13")
     request = urllib.request.Request(url)
     request.add_header("User-Agent",user_agent)
     response = urllib.request.urlopen(request)
     html = response.read().decode("utf8")
     return html

如你所见，我正在正确解码。所以html现在是unicode字符串。当打印html时，我可以看到Unicode字符。在

我正在使用html.parser解析HTML并将其子类化：

^{pr2}$

当使用类的handle_data解析出HTML时，Unicode字符似乎被删除/突然消失。文档中没有提到任何关于编码的内容。为什么HTML解析器会删除非ascii字符，我如何解决这样的问题？在

Tags： url 网页 response request windows def html unicode

1条回答

网友

1楼 · 发布于 2024-04-26 04:36:18

显然，html.parser将在遇到非ascii字符时调用handle_entityref。它传递命名字符引用，为了将其转换为unicode字符，我使用了：

html.entities.html5[name]

Python的文档没有提到这一点。我从未见过比Python更糟糕的文档。在

Unicode正在消失html.pars文件

相关问题更多 >

编程相关推荐

热门问题

热门文章

Unicode正在消失html.pars文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >