Scrapy爬虫在爬网和刮取请求后停止

2024-09-28 21:37:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从MichaelKors.com上搜刮。我的刮刀能正确地在572件物品上爬来爬去。然而,它在一个请求中被卡住了。日志如下:

2019-07-18 04:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.michaelkors.com/painterly-reef-print-crepe-ruffled-skirt/_/R-US_MU97ETPBPL> (referer: https://www.michaelkors.com/women/clothing/skirts-shorts/_/N-28en)
2019-07-18 04:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.michaelkors.com/rainbow-stretch-viscose-pencil-skirt/_/R-US_MU97EYUBZV> (referer: https://www.michaelkors.com/women/clothing/skirts-shorts/_/N-28en)
2019-07-18 04:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.michaelkors.com/rainbow-logo-striped-georgette-skirt/_/R-US_MU97EZ0BZL> (referer: https://www.michaelkors.com/women/clothing/skirts-shorts/_/N-28en)
2019-07-18 04:24:29 [scrapy.extensions.logstats] INFO: Crawled 664 pages (at 11 pages/min), scraped 575 items (at 14 items/min)
2019-07-18 04:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.michaelkors.com/striped-stretch-cotton-pencil-skirt/_/R-US_MU97EY1BVG> (referer: https://www.michaelkors.com/women/clothing/skirts-shorts/_/N-28en)
2019-07-18 04:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.michaelkors.com/butterfly-print-crepe-wrap-skirt/_/R-US_MS97EX3AXN> (referer: https://www.michaelkors.com/women/clothing/skirts-shorts/_/N-28en)
2019-07-18 04:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.michaelkors.com/medallion-lace-skirt/_/R-US_MU97EZ9BXW> (referer: https://www.michaelkors.com/women/clothing/skirts-shorts/_/N-28en)
2019-07-18 04:24:29 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): michaelkors.scene7.com:443

我的铲运机代码如下:

class MichaelKorsClass(CrawlSpider):
    name = 'michaelkors'
    allowed_domains = ['www.michaelkors.com']
    start_urls = ['https://www.michaelkors.com/women/clothing/dresses/_/N-28ei']
    rules = (
        # Rule(LinkExtractor(allow=('(.*\/_\/R-\w\w_)([\-a-zA-Z0-9]*)$', ), deny=('((.*investors.*)|(/info/)|(contact\-us)|(checkout))',   )), callback='parse_product'),
        Rule(LinkExtractor(allow=('(.*\/_\/)(N-[\-a-zA-Z0-9]*)$',),
                           deny=('((.*investors.*)|(/info/)|(contact\-us)|(checkout) | (gifts))',),), callback='parse_list'),
    )



    def parse_product(self, response):
       ...

    def parse_list(self, response):
        hxs = HtmlXPathSelector(response)
        url = response.url

        is_listing_page = False
        product_count = hxs.select('//span[@class="product-count"]/text()').get()
        #print(re.findall('\d+', pc))


        try:
            product_count = int(product_count)
            is_listing_page = True
        except:
            is_listing_page = False
        if is_listing_page:
            for product_url in response.xpath('//ul[@class="product-wrapper product-wrapper-four-tile"]//li[@class="product-name-container"]/a/@href').getall():
                yield scrapy.Request(response.urljoin(product_url), callback=self.parse_product)

parse\u list()通过检查列出了多少个产品,并通过对每个产品进行爬网,递归地对站点进行爬网,parse\u product执行进一步的处理,如下载等。我的代码工作正常,但是,它在我显示日志的某个点卡住了。如果没有卡住,它将建立HTTP连接并请求图像url,如下所示:

2019-07-18 04:23:57 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): michaelkors.scene7.com:443
2019-07-18 04:24:00 [urllib3.connectionpool] DEBUG: https://michaelkors.scene7.com:443 "GET /is/image/MichaelKors/MH73E94C64-0100_2 HTTP/1.1" 200 7267

我希望我正确地解释了我的问题。如果没有,那就提一下我应该从代码中添加或删除什么


Tags: httpscoredebugcomgetparsewwwproduct