Python编码值错误消息

2024-09-28 03:22:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在运行web scraper的以下代码:

 25 # save source page and return xpath tree
 26 def scrape_Page(url, path):
 27     page = requests.get(url)
 28     tree = html.fromstring(page.text)
 29     # save html content
 30     file_name = url.split('/')[-1] + ".html"
 31     with open(os.path.join(path, file_name), 'wb') as srcFile:
 32         webPage = urllib.urlopen(url)
 33         wPageSrc = webPage.read()
 34         webPage.close()
 35         # write to text file
 36         srcFile.write(wPageSrc)
 37     return tree

这段代码对某些url运行良好,但对少数url无效,下面是我得到的错误消息:

^{pr2}$

Tags: path代码textnametreeurlreturnsave
1条回答
网友
1楼 · 发布于 2024-09-28 03:22:02

tl;dr:使用html.fromstring(r.content)。在

有关详细信息,请参阅Python unicode strings下的lxml文档:

… the parsers in lxml.etree can handle unicode strings straight away … This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding … Similarly, you will get errors when you try the same with HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

同时,如果您查看requests文档,在Response Content下:

Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded … When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text.

所以,把它们放在一起,永远不要调用html.fromstring(page.text),因为page.text会自动解码为Unicode,而{}不需要Unicode。lxml需要的是原始的、未编码的字节。在

如何从requests中获取未编码的原始字节?请看requests文档的下一部分,Binary Response Content

You can also access the response body as bytes, for non-text requests … r.content

如果您不理解Unicode字符串和字节字符串之间的区别,以及所有这些解码废话是关于什么的,文档中的Unicode HOWTO有一个很好的解释。但基本上:网络套接字(和文件,以及许多其他东西)只处理字节,这意味着它们只能处理256个不同的值,但却有成百上千个字符。你怎么处理?你选择一个编码,用它把Unicode文本转换成一个字节序列,通过网络发送,然后在另一端解码。这意味着您需要某种方法来指定您选择的编码,以便另一方可以对其进行解码。Web页面通常在页眉中指定它,尽管有一些其他方法可以这样做。requests试着变得聪明,为你挖掘出信息,并负责解码,这样你就不必考虑它了,这通常是非常酷的。不幸的是,lxml同时也试图聪明地为你找出解码方法,如果他们都这么做,他们会互相混淆。在

相关问题 更多 >

    热门问题