带Scrapy XPath选择器tex的Unicode和UTF-8编码问题问题的回答

带Scrapy XPath选择器tex的Unicode和UTF-8编码问题

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

非常感谢你的回答，约翰和史蒂文。你的回答让我有了不同的想法，这让我找到了问题的根源，也找到了有效的解决办法。 我正在使用以下测试代码： <pre><code>import urllib import urllib2 from scrapy.selector import HtmlXPathSelector from scrapy.http import HtmlResponse URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256" url_handler = urllib2.build_opener() urllib2.install_opener(url_handler) handle = url_handler.open(URL) response = handle.read() handle.close() html_response = HtmlResponse(URL).replace(body=response) # Problematic line hxs = HtmlXPathSelector(html_response) desc = hxs.select('//span[@id="attribute-content"]/text()') desc_text = desc.extract()[0] print desc_text print desc_text.encode('utf-8') </code></pre> 在这个破烂的外壳里，当我提取描述数据时，结果很好。这让我有理由怀疑我的代码有问题，因为在<code>pdb</code>提示符下，我看到了提取数据中的替换字符。 我浏览了一下<a href="http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response" rel="nofollow">Response class</a>的零星文档，并将上面的代码调整为： <pre><code>import urllib import urllib2 from scrapy.selector import HtmlXPathSelector from scrapy.http import HtmlResponse URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256" url_handler = urllib2.build_opener() urllib2.install_opener(url_handler) handle = url_handler.open(URL) response = handle.read() handle.close() #html_response = HtmlResponse(URL).replace(body=response) html_response = HtmlResponse(URL, body=response) hxs = HtmlXPathSelector(html_response) desc = hxs.select('//span[@id="attribute-content"]/text()') desc_text = desc.extract()[0] print desc_text print desc_text.encode('utf-8') </code></pre> 我所做的更改是将<code>html_response = HtmlResponse(URL).replace(body=response)</code>行替换为<code>html_response = HtmlResponse(URL, body=response)</code>。我的理解是，<code>replace()</code>方法从编码的角度对特殊字符进行了某种程度的篡改。 如果有人想提供<code>replace()</code>方法到底做错了什么的任何细节，我将非常感谢您的努力。 再次感谢你。

带Scrapy XPath选择器tex的Unicode和UTF-8编码问题

1 个回答

相关Python问题