使用正确的字符编码进行刮取（python请求+beautifulsoup）问题的回答

使用正确的字符编码进行刮取（python请求+beautifulsoup）

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

一般来说，不要使用<code>r.content</code>这是接收到的字节串，而是使用<code>r.text</code>，它是使用<code>requests</code>确定的编码的解码内容。在 在这种情况下，<code>requests</code>将使用UTF-8对传入的字节字符串进行解码，因为这是服务器在<code>Content-Type</code>报头中报告的编码： <pre><code>import requests r = requests.get('http://fm4-archiv.at/files.php?cat=106') >>> type(r.content) # raw content <class 'bytes'> >>> type(r.text) # decoded to unicode <class 'str'> >>> r.headers['Content-Type'] 'text/html; charset=UTF-8' >>> r.encoding 'UTF-8' >>> soup = BeautifulSoup(r.text, 'lxml') </code></pre> 这将解决“Wildlöwenpfleger”问题，然而，页面的其他部分随后开始中断，例如： ^{pr2}$ 显示“Wildlöwenpfleger”是固定的，但现在“übergebergen”和其他第二个链接被破坏。在 在一个HTML文档中似乎使用了多个编码。第一个链接使用UTF-8编码： <pre><code>>>> r.content[8013:8070].decode('iso-8859-1') '<a href="details.php?file=1882">Der WildlÃ¶wenpfleger</a>' >>> r.content[8013:8070].decode('utf8') '<a href="details.php?file=1882">Der Wildlöwenpfleger</a>' </code></pre> 但第二个链接使用ISO-8859-1编码： <pre><code>>>> r.content[2868:3132].decode('iso-8859-1') '<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon übergeben. Auf Streifzügen durch die Popliteratur stößt Hermes auf deren große Themen und hört mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>' >>> r.content[2868:3132].decode('utf8', 'replace') '<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>' </code></pre> 显然，在同一个HTML文档中使用多个编码是不正确的。除了联系文档的作者并要求更正之外，处理混合编码没有太多的事情可以轻松完成。也许您可以在处理数据时对其运行<a href="https://pypi.python.org/pypi/chardet" rel="nofollow noreferrer">^{<cd6>}</a>，但这并不令人愉快。在

使用正确的字符编码进行刮取（python请求+beautifulsoup）

1 个回答

相关Python问题