我在爬www.extratorrent.cc用刮痧。下面是我的蜘蛛:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from extra.items import *
class extraSpider(CrawlSpider):
name = 'extraSpider'
allowed_domains = ['extratorrent.cc']
start_urls = ['http://www.extratorrent.cc/torrent']
rules = [Rule(LinkExtractor(allow=['/\d+/\S+']), 'parse_torrent')]
def parse_torrent(self, response):
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = response.xpath("/html/body/table/tbody/tr[3]/td/table/tbody/tr/td[2]/table[2]/tbody/tr/td[2]/h1").extract()
torrent['description'] = response.xpath("/html/body/table/tbody/tr[3]/td/table/tbody/tr/td[2]/div[4]").extract()
torrent['size'] = response.xpath("/html/body/table/tbody/tr[3]/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td[1]/table/tbody/tr[10]/td[2]").extract()
return torrent
在生成的JSON文件中,我只获取url变量,而不是其他变量,即。名称、尺寸和名称。在
我不知道我哪里出错了,试着改变XPath,但都是徒劳的。有件小事我错过了。在
我对代码做了一些修改,这可能会有帮助
我在这里附加了一些示例输出
^{pr2}$相关问题 更多 >
编程相关推荐