使用Python Scrapy遍历站点

2024-09-30 22:16:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我如何通过Scrapy遍历站点?我想提取匹配http://www.saylor.org/site/syllabus.php?cid=NUMBER的所有站点的主体,其中数字是1到400左右。在

我写了这只蜘蛛:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from syllabi.items import SyllabiItem

class SyllabiSpider(CrawlSpider):

    name = 'saylor'
    allowed_domains = ['saylor.org']
    start_urls = ['http://www.saylor.org/site/syllabus.php?cid=']
    rules = [Rule(SgmlLinkExtractor(allow=['\d+']), 'parse_syllabi')]

    def parse_syllabi(self, response):
        x = HtmlXPathSelector(response)

        syllabi = SyllabiItem()
        syllabi['url'] = response.url
        syllabi['body'] = x.select("/html/body/text()").extract()
        return syllabi

但它不起作用。我知道它在寻找起始网址的链接,这不是我真正想要它做的。我想在这些网站上进行迭代。有道理?在

谢谢你的帮助。在


Tags: fromorgimporthttp站点responsewwwsite
1条回答
网友
1楼 · 发布于 2024-09-30 22:16:48

试试这个:

from scrapy.spider import BaseSpider
from scrapy.http import Request
from syllabi.items import SyllabiItem

class SyllabiSpider(BaseSpider):
    name = 'saylor'
    allowed_domains = ['saylor.org']
    max_cid = 400

    def start_requests(self):
        for i in range(self.max_cid):
            yield Request('http://www.saylor.org/site/syllabus.php?cid=%d' % i,
                    callback=self.parse_syllabi)

    def parse_syllabi(self, response):
        syllabi = SyllabiItem()
        syllabi['url'] = response.url
        syllabi['body'] = response.body

        return syllabi

相关问题 更多 >