使用正确的字符编码进行刮取（python请求+beautifulsoup）

2条回答

网友

1楼 · 编辑于 2024-09-28 22:14:47

我刚找到两个解决办法。你能确认一下吗？在

Soup = BeautifulSoup(r.content.decode('utf-8','ignore'),"lxml")

以及

^{pr2}$

这两个结果都会产生以下示例输出：

Der Wildlöwenpfleger

编辑： 我只是想知道为什么这些工作，因为r.encoding结果是{}。这说明请求无论如何都将数据处理为UTF-8数据。因此，我想知道为什么.decode('utf-8','ignore')或{}会产生所需的输出？在

编辑2: 好吧，我想我现在明白了。.decode('utf-8','ignore')和fromEncoding='utf-8'意味着实际数据被编码为UTF-8，而beautifulGroup应该对其进行解析，并将其处理为UTF-8编码的数据，实际上就是这样。在

我假设requests正确地将其处理为UTF-8，但{}没有。因此，我必须做这个额外的解码。在

网友

2楼 · 编辑于 2024-09-28 22:14:47

一般来说，不要使用r.content这是接收到的字节串，而是使用r.text，它是使用requests确定的编码的解码内容。在

在这种情况下，requests将使用UTF-8对传入的字节字符串进行解码，因为这是服务器在Content-Type报头中报告的编码：

import requests

r = requests.get('http://fm4-archiv.at/files.php?cat=106')

>>> type(r.content)    # raw content
<class 'bytes'>
>>> type(r.text)       # decoded to unicode
<class 'str'>    
>>> r.headers['Content-Type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'

>>> soup = BeautifulSoup(r.text, 'lxml')

这将解决“Wildlöwenpfleger”问题，然而，页面的其他部分随后开始中断，例如：

^{pr2}$

显示“Wildlöwenpfleger”是固定的，但现在“übergebergen”和其他第二个链接被破坏。在

在一个HTML文档中似乎使用了多个编码。第一个链接使用UTF-8编码：

>>> r.content[8013:8070].decode('iso-8859-1')
'<a href="details.php?file=1882">Der WildlÃ¶wenpfleger</a>'

>>> r.content[8013:8070].decode('utf8')
'<a href="details.php?file=1882">Der Wildlöwenpfleger</a>'

但第二个链接使用ISO-8859-1编码：

>>> r.content[2868:3132].decode('iso-8859-1')
'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon übergeben. Auf Streifzügen durch die Popliteratur stößt Hermes auf deren große Themen und hört mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'

>>> r.content[2868:3132].decode('utf8', 'replace')
'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'

显然，在同一个HTML文档中使用多个编码是不正确的。除了联系文档的作者并要求更正之外，处理混合编码没有太多的事情可以轻松完成。也许您可以在处理数据时对其运行^{}，但这并不令人愉快。在

相关问题更多 >

编程相关推荐

热门问题

热门文章