带Scrapy XPath选择器tex的Unicode和UTF-8编码问题

3条回答

网友

1楼 · 编辑于 2024-09-28 20:52:41

非常感谢你的回答，约翰和史蒂文。你的回答让我有了不同的想法，这让我找到了问题的根源，也找到了有效的解决办法。

我正在使用以下测试代码：

import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse

URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"

url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)

handle = url_handler.open(URL)
response = handle.read()
handle.close()

html_response = HtmlResponse(URL).replace(body=response) # Problematic line
hxs = HtmlXPathSelector(html_response)

desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')

在这个破烂的外壳里，当我提取描述数据时，结果很好。这让我有理由怀疑我的代码有问题，因为在pdb提示符下，我看到了提取数据中的替换字符。

我浏览了一下Response class的零星文档，并将上面的代码调整为：

import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse

URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"

url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)

handle = url_handler.open(URL)
response = handle.read()
handle.close()

#html_response = HtmlResponse(URL).replace(body=response)
html_response = HtmlResponse(URL, body=response)
hxs = HtmlXPathSelector(html_response)

desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')

我所做的更改是将html_response = HtmlResponse(URL).replace(body=response)行替换为html_response = HtmlResponse(URL, body=response)。我的理解是，replace()方法从编码的角度对特殊字符进行了某种程度的篡改。

如果有人想提供replace()方法到底做错了什么的任何细节，我将非常感谢您的努力。

再次感谢你。

网友

2楼 · 编辑于 2024-09-28 20:52:41

u'\ufffd'是"unicode replacement character"，通常打印为黑色三角形内的问号。不是巫术。所以问题一定在上游。检查返回的网页标题所说的编码，并验证它实际上是什么，它所说的是什么。

unicode替换字符通常是作为非法或无法识别的字符的替换插入的，这可能是由多个原因造成的，但最有可能的是编码并不是它声称的那样。

网友

3楼 · 编辑于 2024-09-28 20:52:41

U+FFFD是执行some_bytes.decode('some-encoding', 'replace')操作时得到的替换字符，并且some_bytes的某些子字符串无法解码。

你有两个：u'H\ufffd\ufffdftsitz'。。。这表示u-umlaut被表示为两个字节，每个字节都无法解码。最有可能的是，这个站点是用UTF-8编码的，但是软件试图将其解码为ASCII。尝试解码为ASCII通常发生在意外转换为Unicode时，ASCII用作默认编码。然而，在这种情况下，人们不会期望使用'replace'arg。更可能的是，代码接受了一种编码，并且是由一个认为“不引发异常”的人编写的，意思与“工作”相同。

编辑问题以提供URL，并显示生成u'H\ufffd\ufffdftsitz'的最小代码。

相关问题更多 >

编程相关推荐

热门问题

热门文章