如何使用Scrapy进行URL爬网

2024-10-06 07:12:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我想抓取链接https://www.aparat.com/。你知道吗

我正确地对其进行爬网,并获取带有标题标记的所有视频链接;如下所示:

import scrapy
class BlogSpider(scrapy.Spider):
    name = 'aparatspider'
    start_urls = ['https://www.aparat.com/']
    def parse(self, response):
        print '=' * 80 , 'latest-trend :'
        ul5 = response.css('.block-grid.xsmall-block-grid-2.small-block-grid-3.medium-block-grid-4.large-block-grid-5.is-not-center')
        ul5 = ul5.css('ul').css('li')
        latesttrend = []
        for li5 in ul5:
           latesttrend.append(li5.xpath('div/div[1]/a').xpath('@onmousedown').extract_first().encode('utf8'))
           print(latesttrend)

现在我的问题是:

我怎样才能从داغ ترین ها标签获得超过1000个链接?目前,我只得到60,或多或少。你知道吗


Tags: httpscom链接responsewwwblockcssxpath
1条回答
网友
1楼 · 发布于 2024-10-06 07:12:35

我使用以下代码执行此操作:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request


class aparat_hotnewsItem(scrapy.Item):

      videourl = scrapy.Field()


class aparat_hotnewsSpider(CrawlSpider):
      name = 'aparat_hotnews'
      allowed_domains = ['www.aparat.com']
      start_urls = ['http://www.aparat.com/']

      # Xpath for selecting links to follow
      xp = 'your xpath'

      rules = (
    Rule(LinkExtractor(restrict_xpaths=xp), callback='parse_item', follow=True),
      )

      def parse_item(self, response):

      item = aparat_hotnewsItem()

      item['videourl'] = response.xpath('your xpath').extract()
      yield item

相关问题 更多 >