我如何通过Scrapy遍历站点?我想提取匹配http://www.saylor.org/site/syllabus.php?cid=NUMBER
的所有站点的主体,其中数字是1到400左右。在
我写了这只蜘蛛:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from syllabi.items import SyllabiItem
class SyllabiSpider(CrawlSpider):
name = 'saylor'
allowed_domains = ['saylor.org']
start_urls = ['http://www.saylor.org/site/syllabus.php?cid=']
rules = [Rule(SgmlLinkExtractor(allow=['\d+']), 'parse_syllabi')]
def parse_syllabi(self, response):
x = HtmlXPathSelector(response)
syllabi = SyllabiItem()
syllabi['url'] = response.url
syllabi['body'] = x.select("/html/body/text()").extract()
return syllabi
但它不起作用。我知道它在寻找起始网址的链接,这不是我真正想要它做的。我想在这些网站上进行迭代。有道理?在
谢谢你的帮助。在
试试这个:
相关问题 更多 >
编程相关推荐