Scrapy返回unicode如何转换成字符串?

2024-10-03 23:20:58 发布

您现在位置:Python中文网/ 问答频道 /正文

当我使用scrapy shell向url发出请求时,我会返回如下内容:

In [6]: sel.xpath("//div[@class='my_class']").extract()
 [u'<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440\u04....

如何将其转换为可读字符串?在


Tags: divmyliulclasshrefcategoryu0430
2条回答

一些评论:

  • sel.xpath("//div[@class='my_class']")选择div元素。

  • sel.xpath("//div[@class='my_class']").extract()获取所选元素的字符串表示形式:HTML、列表;如果所选内容内的文本节点包含unicode代码点,则将unicode内容作为^{} escape sequences

也可以使用XPath's ^{} function直接请求选定节点的字符串表示形式:

  • sel.xpath("string(//div[@class='my_class'])").extract()

  • 或者使用text()节点的字符串连接的通用模式:"".join(sel.xpath("//div[@class='my_class']//text()").extract())

注意,string()将只考虑与表达式匹配的第一个元素作为参数。来自XPath 1.0规范:

A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order.


scrapy shell会话示例:

$ scrapy shell
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f06700bc2d0>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x7f06700b6f10>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: import scrapy

In [2]: sel = scrapy.Selector(text=u'''<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440''')

In [3]: print "".join(sel.xpath('//div[@class="my_class"]//text()').extract())


ТРАКТОРЫ и РАЙДЕРЫ
Садовые трактор

In [4]: for r in sel.xpath('string(//div[@class="my_class"])').extract():
    print r
   ...:     


ТРАКТОРЫ и РАЙДЕРЫ
Садовые трактор

In [5]: 

一旦打印(或写入文件),它将是可读的

>>> u = u'<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440'
>>> print (u)
<div class="my_class"><ul><li class="parent">
<a href="/category/tractors-ride-on-mowers/">
ТРАКТОРЫ и РАЙДЕРЫ</a>
<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">Садовые трактор
>>> 

相关问题 更多 >