
2024-09-28 03:22:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在运行web scraper的以下代码:

 25 # save source page and return xpath tree
 26 def scrape_Page(url, path):
 27     page = requests.get(url)
 28     tree = html.fromstring(page.text)
 29     # save html content
 30     file_name = url.split('/')[-1] + ".html"
 31     with open(os.path.join(path, file_name), 'wb') as srcFile:
 32         webPage = urllib.urlopen(url)
 33         wPageSrc =
 34         webPage.close()
 35         # write to text file
 36         srcFile.write(wPageSrc)
 37     return tree



Tags: path代码textnametreeurlreturnsave
1楼 · 发布于 2024-09-28 03:22:02


有关详细信息,请参阅Python unicode strings下的lxml文档:

… the parsers in lxml.etree can handle unicode strings straight away … This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding … Similarly, you will get errors when you try the same with HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

同时,如果您查看requests文档,在Response Content下:

Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded … When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text.


如何从requests中获取未编码的原始字节?请看requests文档的下一部分,Binary Response Content

You can also access the response body as bytes, for non-text requests … r.content

如果您不理解Unicode字符串和字节字符串之间的区别,以及所有这些解码废话是关于什么的,文档中的Unicode HOWTO有一个很好的解释。但基本上:网络套接字(和文件,以及许多其他东西)只处理字节,这意味着它们只能处理256个不同的值,但却有成百上千个字符。你怎么处理?你选择一个编码,用它把Unicode文本转换成一个字节序列,通过网络发送,然后在另一端解码。这意味着您需要某种方法来指定您选择的编码,以便另一方可以对其进行解码。Web页面通常在页眉中指定它,尽管有一些其他方法可以这样做。requests试着变得聪明,为你挖掘出信息,并负责解码,这样你就不必考虑它了,这通常是非常酷的。不幸的是,lxml同时也试图聪明地为你找出解码方法,如果他们都这么做,他们会互相混淆。在

相关问题 更多 >
