如何用scrapy框架刮网页?

2024-06-30 16:37:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我是网络垃圾的新手。我已经开始学习scrapy框架

我学习了《刮痧》的基础教程。现在,我正试图取消this

根据this教程,要获得包含以下内容的整个html页面,应编写以下代码:

import scrapy


class ClothesSpider(scrapy.Spider):
    name = "clothes"

    start_urls = [
        'https://www.chumbak.com/women-apparel/GY1/c/',
    ]

    def parse(self, response):
        filename = 'clothes.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

这段代码运行良好。但我没有得到预期的结果

当我打开衣服.html时,html数据与我在浏览器中检查时的数据不同。衣服.html中缺少很多东西

我不明白这里出了什么问题。请帮我向前推。 任何帮助都将不胜感激

谢谢


Tags: 数据网络框架responsehtml教程filenamethis
1条回答
网友
1楼 · 发布于 2024-06-30 16:37:48

此页面使用JavaScript将数据放在页面上

使用Chrome/Firefox中的DevTool,您可以看到哪些URL使用JavaScript从服务器获取此数据(选项卡网络、过滤器XHR)

然后你也可以尝试获取数据

代码使用JSON数据为10个页面生成URL并下载它们,保存在单独的文件中,生成完整的图像URL,然后将它们下载到子文件夹fullScrapy还将有关下载图像的所有数据保存在output.json

#!/usr/bin/env python3

import scrapy
#from scrapy.commands.view import open_in_browser
import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    #allowed_domains = []

    #start_urls = ['https://www.chumbak.com/women-apparel/GY1/c/']

    #start_urls = [
    #    'https://api-cdn.chumbak.com/v1/category/474/products/?count_per_page=24&page=1',
    #    'https://api-cdn.chumbak.com/v1/category/474/products/?count_per_page=24&page=2',
    #    'https://api-cdn.chumbak.com/v1/category/474/products/?count_per_page=24&page=3',
    #]

    def start_requests(self):
        pages = 10
        url_template = 'https://api-cdn.chumbak.com/v1/category/474/products/?count_per_page=24&page={}'

        for page in range(1, pages+1):
            url = url_template.format(page)
            yield scrapy.Request(url)

    def parse(self, response):
        print('url:', response.url)

        #open_in_browser(response)

        # get page number
        page_number = response.url.strip('=')[-1]

        # save JSON in separated file
        filename = 'page-{}.json'.format(page_number)
        with open(filename, 'wb') as f:
           f.write(response.body)

        # convert JSON into Python's dictionary
        data = json.loads(response.text)

        # get urls for images
        for product in data['products']:
            #print('title:', product['title'])
            #print('url:', product['url'])
            #print('image_url:', product['image_url'])

            # create full url to image
            image_url = 'https://media.chumbak.com/media/catalog/product/small_image/260x455' + product['image_url']
            # send it to scrapy and it will download it
            yield {'image_urls': [image_url]}


        # download files
        #for href in response.css('img::attr(href)').extract():
        #   url = response.urljoin(src)
        #   yield {'file_urls': [url]}

        # download images and convert to JPG
        #for src in response.css('img::attr(src)').extract():
        #   url = response.urljoin(src)
        #   yield {'image_urls': [url]}

#  - it runs without project and saves in `output.csv`  -

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in CSV or JSON
    'FEED_FORMAT': 'json',     # 'cvs', 'json', 'xml'
    'FEED_URI': 'output.json', # 'output.cvs', 'output.json', 'output.xml'

    # download files to `FILES_STORE/full`
    # it needs `yield {'file_urls': [url]}` in `parse()`
    #'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},
    #'FILES_STORE': '/path/to/valid/dir',

    # download images and convert to JPG
    # it needs `yield {'image_urls': [url]}` in `parse()`
    #'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},
    #'IMAGES_STORE': '/path/to/valid/dir',
    'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},
    'IMAGES_STORE': '.',
})
c.crawl(MySpider)
c.start()

相关问题 更多 >