用scrapy crawspid设置规则

class Test_Spider(CrawlSpider): name = "test" allowed_domains = ['http://www.dragonflieswellness.com'] start_urls = ['http://www.dragonflieswellness.com/wp-content/uploads/2015/09/'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). # Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(LinkExtractor(allow='.jpg'), callback='parse_item'), ) def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.url) print(response.url)

1条回答

网友

1楼 · 发布于 2024-10-02 08:28:32

首先，使用规则的目的不仅是提取链接，而且最重要的是遵循它们。如果您只想提取链接（例如，保存它们以备以后使用），则不必指定爬行器规则。另一方面，如果您想下载图像，请使用pipeline。在

也就是说，蜘蛛不跟踪链接的原因隐藏在LinkExtractor的实现中：

# common file extensions that are not followed if they occur in links
IGNORED_EXTENSIONS = [
    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
    'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg',

    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',

    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv',
'm4a',

    # office suites
    'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg',
'odp',

    # other
    'css', 'pdf', 'exe', 'bin', 'rss', 'zip', 'rar',
]

编辑：

为了在本例中使用ImagesPipeline下载图像：

将此添加到设置：

^{pr2}$

创建新项目：

class MyImageItem(Item):
    images = Field()
    image_urls = Field()

修改你的蜘蛛（添加一个解析方法）：

    def parse(self, response):
        loader = ItemLoader(item=MyImageItem(), response=response)
        img_paths = response.xpath('//a[substring(@href, string-length(@href)-3)=".jpg"]/@href').extract()
        loader.add_value('image_urls', [self.start_urls[0] + img_path for img_path in img_paths])
        return loader.load_item()

xpath搜索以“.jpg”结尾的所有href，extract（）方法创建一个列表。在

加载器是一个附加的特性，可以简化对象的创建，但是没有它也可以。在

请注意，我不是专家，可能有更好、更优雅的解决方案。不过，这个很好用。在

相关问题更多 >

编程相关推荐

热门问题

热门文章