<p>非常感谢你的回答,约翰和史蒂文。你的回答让我有了不同的想法,这让我找到了问题的根源,也找到了有效的解决办法。</p>
<p>我正在使用以下测试代码:</p>
<pre><code>import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"
url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)
handle = url_handler.open(URL)
response = handle.read()
handle.close()
html_response = HtmlResponse(URL).replace(body=response) # Problematic line
hxs = HtmlXPathSelector(html_response)
desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')
</code></pre>
<p>在这个破烂的外壳里,当我提取描述数据时,结果很好。这让我有理由怀疑我的代码有问题,因为在<code>pdb</code>提示符下,我看到了提取数据中的替换字符。</p>
<p>我浏览了一下<a href="http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response" rel="nofollow">Response class</a>的零星文档,并将上面的代码调整为:</p>
<pre><code>import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"
url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)
handle = url_handler.open(URL)
response = handle.read()
handle.close()
#html_response = HtmlResponse(URL).replace(body=response)
html_response = HtmlResponse(URL, body=response)
hxs = HtmlXPathSelector(html_response)
desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')
</code></pre>
<p>我所做的更改是将<code>html_response = HtmlResponse(URL).replace(body=response)</code>行替换为<code>html_response = HtmlResponse(URL, body=response)</code>。我的理解是,<code>replace()</code>方法从编码的角度对特殊字符进行了某种程度的篡改。</p>
<p>如果有人想提供<code>replace()</code>方法到底做错了什么的任何细节,我将非常感谢您的努力。</p>
<p>再次感谢你。</p>