使用scrapy从gsmarena页面提取数据

from scrapy.selector import Selector from scrapy import Spider from gsmarena_data.items import gsmArenaDataItem class testSpider(Spider): name = "mobile_test" allowed_domains = ["gsmarena.com"] start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',) def parse(self, response): # extract whatever stuffs you want and yield items here hxs = Selector(response) phone = gsmArenaDataItem() tableRows = hxs.css("div#specs-list table") for tableRows in tableRows: phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0] for ttl in tableRows.xpath(".//td[@class='ttl']"): ttl_value = " ".join(ttl.xpath(".//text()").extract()) nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract()) colonSign = ": " commaSign = ", " seq = [ttl_value, colonSign, nfo_value, commaSign] phone['phoneDetails'] = "".join(seq) yield phone

2条回答

网友

1楼 · 编辑于 2024-06-26 02:08:03

我还面临着同样的问题：在几个请求中被禁止，使用scrapy-proxies和使用{a2}更改代理有很大帮助，但没有完全解决问题。在

你可以在gsmarenacrawler找到我的代码

网友

2楼 · 编辑于 2024-06-26 02:08:03

其思想是迭代“spec list”中的所有table元素，获取块名称的th元素，获得所有带有class="ttl"的td元素以及与class="nfo"对应的td同级。在

来自shell的演示：

In [1]: for scope in response.css("div#specs-list table"):
            scope_name = scope.xpath(".//th/text()").extract()[0]

            for ttl in scope.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())

                print scope_name, ttl_value, nfo_value
   ....:     
Network Technology GSM / HSPA / LTE
Network 2G bands GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2
...
Battery Stand-by Up to 598 h (2G) / Up to 626 h (3G)
Battery Talk time Up to 23 h (2G) / Up to 13 h (3G)
Misc Colors Meteor Grey, Rose Gold, Gold Sepia

相关问题更多 >

编程相关推荐

热门问题

热门文章