<p>当您有一个<code>text/*</code>响应且响应头中未指定任何内容类型时,Requests将<code>response.encoding</code>属性设置为<code>ISO-8859-1</code>。在</p>
<p>参见<a href="http://docs.python-requests.org/en/latest/user/advanced/#encodings" rel="nofollow noreferrer"><em>Encoding</em> section of the <em>Advanced</em> documentation</a>:</p>
<blockquote>
<p>The only time Requests will not do this is if no explicit charset is present in the HTTP headers <em>and</em> the <code>Content-Type</code> header contains <code>text</code>. <strong>In this situation, RFC 2616 specifies that the default charset must be <code>ISO-8859-1</code></strong>. Requests follows the specification in this case. If you require a different encoding, you can manually set the <code>Response.encoding</code> property, or use the raw <code>Response.content</code>.</p>
</blockquote>
<p>大胆强调我的。在</p>
<p>您可以通过在<code>Content-Type</code>头中查找<code>charset</code>参数来进行测试:</p>
<pre><code>resp = requests.get(....)
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
</code></pre>
<p>您的HTML文档在<code><meta></code>头中指定了内容类型,并且此标题是权威的:</p>
^{pr2}$
<p>HTML5还定义了一个<code><meta charset="..." /></code>标记,请参见<a href="https://stackoverflow.com/questions/4696499/meta-charset-utf-8-vs-meta-http-equiv-content-type"><meta charset="utf-8"> vs <meta http-equiv="Content-Type"></a></p>
<p>如果HTML页面包含这样一个具有不同编解码器的报头,则应该<strong>不</strong>将其重新编码为UTF-8。在这种情况下,您至少必须纠正标题。在</p>
<p>使用BeautifulSoup:</p>
<pre><code># pass in explicit encoding if set as a header
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
content = resp.content
soup = BeautifulSoup(content, from_encoding=encoding)
if soup.original_encoding != 'utf-8':
meta = soup.select_one('meta[charset], meta[http-equiv="Content-Type"]')
if meta:
# replace the meta charset info before re-encoding
if 'charset' in meta.attrs:
meta['charset'] = 'utf-8'
else:
meta['content'] = 'text/html; charset=utf-8'
# re-encode to UTF-8
content = soup.prettify() # encodes to UTF-8 by default
</code></pre>
<p>类似地,其他文档标准也可能指定特定的编码;例如,除非由<code><?xml encoding="..." ... ?></code>XML声明指定,否则XML总是UTF-8,这也是文档的一部分。在</p>