Scrapy解析站点的某个部分，并忽略res

爬行蜘蛛名为“wiseowlsp.py公司“包括：

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class WiseowlspSpider(CrawlSpider): name = "wiseowlsp" allowed_domains = ['www.wiseowl.co.uk'] start_urls = ['http://www.wiseowl.co.uk/videos/'] rules = [Rule(LinkExtractor(restrict_xpaths='//li[@class="woMenuItem"]')), Rule(LinkExtractor(restrict_xpaths='//div[@class="woPaging tac"]'), callback='parse_items')] def parse_items(self, response): page = response.xpath('//div[@class="woVideoListRow"]') for title in page: AA = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/text()').extract() BB = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/@href').extract() yield {'Name':AA,'Url':BB}

如果我使用我要粘贴的样式，我会得到我想要的结果，但我希望避免使用Regex。你知道吗

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from wiseowl.items import WiseowlItem class WiseowlspSpider(CrawlSpider): name = "wiseowlsp" allowed_domains = ["wiseowl.co.uk"] start_urls = ['http://www.wiseowl.co.uk/videos/'] rules = [Rule(LinkExtractor(allow=('uk/videos/.*')),callback='parse_items', follow=True)] def parse_items(self, response): page = response.xpath('//div[@class="woVideoListRow"]') for title in page: item=WiseowlItem() item["Name"] = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/text()').extract() item["Url"] = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/@href').extract() yield item

在这种情况下，restrict\u xpaths总是忽略第一页并从下一页开始刮取，直到它结束。我相信应该有任何方式或方法（在这个restrict\u xpaths模式中进行限制）应用第一页的哪些数据也可以被刮取。希望有人能推一把。你知道吗

1条回答

网友

1楼 · 发布于 2024-05-02 07:14:37

我讨厌使用典型的Rule和LinkExtractor，这很难理解，Scrapy自己做任何事情。你知道吗

我总是喜欢使用start_requests方法，这是您的Spider的入口点

对于您正在抓取的网站，我会首先在脑海中开发逻辑，然后将其翻译成代码。你知道吗

转到主页
转到左侧的每个分类页
在每一页上刮掉每一项
如果有下一页链接，请转到下一页

这是100%的工作代码。你知道吗

from scrapy.contrib.spiders import CrawlSpider
from scrapy.http.request import Request
import logging

class WiseowlspSpider(CrawlSpider):
    name = "wiseowlsp"

    def start_requests(self):
        # got to home page
        yield Request(url = "http://www.wiseowl.co.uk/videos/", callback = self.parse_home_page)


    def parse_home_page(self, response):
        # parse all links on left
        for cat in response.css(".woMenuList > li"):
            logging.info("\n\n\nScraping Category: %s" % (cat.css("a::text").extract_first()))
        yield Request(url = "http://www.wiseowl.co.uk" + cat.css("a::attr(href)").extract_first() , callback = self.parse_listing_page)


    def parse_listing_page(self, response):
        items = response.xpath('//div[@class="woVideoListRow"]')
        for title in items:
            AA = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/text()').extract()
            BB = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/@href').extract()
            yield {'Name':AA,'Url':BB}


        next_page = response.css("a.woPagingNext::attr(href)").extract_first()

        if next_page is not None:
        logging.info("\n\n\nGoing to next page %s" % (next_page))
        # If there is next page scrape it
        yield Request(url = "http://www.wiseowl.co.uk" + next_page , callback = self.parse_listing_page)
    else:
        for more_pages in response.css("a.woPagingItem"):
                next_page = more_pages.css("::attr(href)").extract_first()

            logging.info("\n\n\nGoing to next page %s" % (next_page))
            # If there is next page scrape it
            yield Request(url = "http://www.wiseowl.co.uk" + next_page , callback = self.parse_listing_page)

在settings.py写下这个

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

现在你可以看到我的代码可以很容易地从上到下阅读，你可以理解它的逻辑。你知道吗

““项目.py“包括：

爬行蜘蛛名为“wiseowlsp.py公司“包括：

相关问题更多 >

编程相关推荐

热门问题

热门文章