Scrapy解析站点的某个部分,并忽略res

2024-05-02 07:14:37 发布

您现在位置:Python中文网/ 问答频道 /正文

当我运行scraper时,它从一个站点上刮下大约200条记录,而这个站点包含大约250条记录。我搞不清楚自己在创作过程中犯了什么错误。任何帮助都将不胜感激。你知道吗

““项目.py“包括:

import scrapy
class WiseowlItem(scrapy.Item):
    Name = scrapy.Field()
    Url= scrapy.Field()

爬行蜘蛛名为“wiseowlsp.py公司“包括:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class WiseowlspSpider(CrawlSpider):
    name = "wiseowlsp"
    allowed_domains = ['www.wiseowl.co.uk']
    start_urls = ['http://www.wiseowl.co.uk/videos/']
    rules = [Rule(LinkExtractor(restrict_xpaths='//li[@class="woMenuItem"]')),
            Rule(LinkExtractor(restrict_xpaths='//div[@class="woPaging tac"]'),
            callback='parse_items')]

    def parse_items(self, response):
        page = response.xpath('//div[@class="woVideoListRow"]')
        for title in page:
            AA = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/text()').extract()
            BB = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/@href').extract()
            yield {'Name':AA,'Url':BB}

如果我使用我要粘贴的样式,我会得到我想要的结果,但我希望避免使用Regex。你知道吗

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from wiseowl.items import WiseowlItem

class WiseowlspSpider(CrawlSpider):
    name = "wiseowlsp"
    allowed_domains = ["wiseowl.co.uk"]
    start_urls = ['http://www.wiseowl.co.uk/videos/']
    rules = [Rule(LinkExtractor(allow=('uk/videos/.*')),callback='parse_items', follow=True)]

    def parse_items(self, response):
        page = response.xpath('//div[@class="woVideoListRow"]')
        for title in page:
            item=WiseowlItem()
            item["Name"] = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/text()').extract()
            item["Url"] = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/@href').extract()
            yield item

在这种情况下,restrict\u xpaths总是忽略第一页并从下一页开始刮取,直到它结束。我相信应该有任何方式或方法(在这个restrict\u xpaths模式中进行限制)应用第一页的哪些数据也可以被刮取。希望有人能推一把。你知道吗


Tags: fromimporttitleitemscontribrulexpathclass
1条回答
网友
1楼 · 发布于 2024-05-02 07:14:37

我讨厌使用典型的RuleLinkExtractor,这很难理解,Scrapy自己做任何事情。你知道吗

我总是喜欢使用start_requests方法,这是您的Spider的入口点

对于您正在抓取的网站,我会首先在脑海中开发逻辑,然后将其翻译成代码。你知道吗

  1. 转到主页
  2. 转到左侧的每个分类页
  3. 在每一页上刮掉每一项
  4. 如果有下一页链接,请转到下一页

这是100%的工作代码。你知道吗

from scrapy.contrib.spiders import CrawlSpider
from scrapy.http.request import Request
import logging

class WiseowlspSpider(CrawlSpider):
    name = "wiseowlsp"

    def start_requests(self):
        # got to home page
        yield Request(url = "http://www.wiseowl.co.uk/videos/", callback = self.parse_home_page)


    def parse_home_page(self, response):
        # parse all links on left
        for cat in response.css(".woMenuList > li"):
            logging.info("\n\n\nScraping Category: %s" % (cat.css("a::text").extract_first()))
        yield Request(url = "http://www.wiseowl.co.uk" + cat.css("a::attr(href)").extract_first() , callback = self.parse_listing_page)


    def parse_listing_page(self, response):
        items = response.xpath('//div[@class="woVideoListRow"]')
        for title in items:
            AA = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/text()').extract()
            BB = title.xpath('.//p[@class="woVideoListDefaultSeriesTitle"]/a/@href').extract()
            yield {'Name':AA,'Url':BB}


        next_page = response.css("a.woPagingNext::attr(href)").extract_first()

        if next_page is not None:
        logging.info("\n\n\nGoing to next page %s" % (next_page))
        # If there is next page scrape it
        yield Request(url = "http://www.wiseowl.co.uk" + next_page , callback = self.parse_listing_page)
    else:
        for more_pages in response.css("a.woPagingItem"):
                next_page = more_pages.css("::attr(href)").extract_first()

            logging.info("\n\n\nGoing to next page %s" % (next_page))
            # If there is next page scrape it
            yield Request(url = "http://www.wiseowl.co.uk" + next_page , callback = self.parse_listing_page)

settings.py写下这个

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

现在你可以看到我的代码可以很容易地从上到下阅读,你可以理解它的逻辑。你知道吗

相关问题 更多 >