Web Scraping:运行spider时输入空/NA/Null,在scrapy sh中输入正确的条目

2024-06-26 13:47:25 发布

您现在位置:Python中文网/ 问答频道 /正文

一个色情网站,我做了一个蜘蛛,通过最新的视频分页爬行,抓取每页32个视频的元数据。你知道吗

接下来是我的蜘蛛代码:

class NaughtySpider(scrapy.Spider):
  name = "naughtyspider"
  allowed_domains = ["pornhub.com"]
  max_pages = 3
  # Start request
  def start_requests(self):
        for i in range(1, self.max_pages):
            yield scrapy.Request('https://www.pornhub.com/video?o=cm&page=%s' % i, callback=self.parse_video)
  # First parsing method
  def parse_video(self, response):
    self.log('F i n i s h e d  s c r a p i n g ' + response.url)
    video_links = response.css('ul#videoCategory').css('li.videoBox').css('div.thumbnail-info-wrapper').css('span.title > a').css('::attr(href)') #Correct path, chooses 32 videos from page ignoring the links coming from ads
    links_to_follow = video_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_metadata)
  # Second parsing method
  def parse_metadata(self, response):
    # Create a SelectorList of the course titles text
    video_title = response.css('div.title-container > h1.title > span.inlineFree::text')
    # Extract the text and strip it clean
    video_title_ext = video_title.extract_first()
    # Extract views
    video_views = response.css('span.count::text').extract_first()
    # Extract tags
    video_tags = response.css('div.tagsWrapper a::text').extract()
    # Extract Categories
    video_categories = response.css('div.categoriesWrapper a::text').extract()
    # Fill in the dictionary
    yield {
        'title': video_title_ext,
        'views': video_views,
        'tags': video_tags,
        'categories': video_categories,
    }

问题是,几乎一半的条目最终都是空的,没有标题、视图、标签或类别。日志示例:

[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.pornhub.com/view_video.php?viewkey=ph5d594b093f8d6>
{'title': None, 'views': None, 'tags': [], 'categories': []}

但同时,如果我在scrapy shell中获取完全相同的链接,并在spider中复制和粘贴完全相同的选择器路径,它将为我提供正确的值:

In [4]: fetch('https://www.pornhub.com/view_video.php?viewkey=ph5d594b093f8d6')
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.pornhub.com/view_video.php?viewkey=ph5d594b093f8d6> (referer: None)

In [5]: response.css('div.tagsWrapper a::text').extract()
Out[5]: ['alday', '559', '+ ']

In [6]: response.css('span.count::text').extract_first()
Out[6]: '6'

提前谢谢你的帮助。你知道吗

编辑:我是否正确地认为这不是我的代码的问题,而是对服务器的限制,以避免被刮取?


Tags: texthttpsselfdivcomtitleresponsevideo
1条回答
网友
1楼 · 发布于 2024-06-26 13:47:25

视图、持续时间等数据。。。似乎由HTML变量元素<var> DATA </var>调用。例如,如果您在scrapy shell中输入以下行,则应该获得视图。你知道吗

response.xpath(".//var[@class='duration')")

不确定是否有效,但值得一试。你知道吗

顺便说一句,我得告诉我妻子那是为了教育。。

相关问题 更多 >