使用lxml和requests的HTML抓取提供了一个unicod

from lxml import html import requests page = requests.get('http://cancer.sanger.ac.uk/cosmic/gene/analysis?ln=PTEN&ln1=PTEN&start=130&end=140&coords=bp%3AAA&sn=&ss=&hn=&sh=&id=15#') tree = html.fromstring(page.text)

1条回答

网友

1楼 · 发布于 2024-09-28 20:59:16

简而言之：使用page.content，而不是page.text。

来自http://lxml.de/parsing.html#python-unicode-strings：

the parsers in lxml.etree can handle unicode strings straight away ... This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding

来自http://docs.python-requests.org/en/latest/user/quickstart/#response-content：

Requests will automatically decode content from the server [as r.text]. ... You can also access the response body as bytes [as r.content].

所以你看，requests.text和lxml.etree都想把utf-8解码成unicode。但是，如果让requests.text进行解码，那么xml文件中的编码语句就变成了谎言。

所以，让requests.content不要解码。这样lxml将收到一个一致的未编码文件。

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用lxml和requests的HTML抓取提供了一个unicod

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >