指向本地缓存而不是执行正常的spidering进程

1条回答

网友

1楼 · 发布于 2024-09-17 19:44:40

我认为custom Downloader Middleware是个好方法。其思想是让这个中间件直接从数据库返回源代码，而不让Scrapy发出任何HTTP请求。在

示例实现（未经测试，肯定需要错误处理）：

import re

import MySQLdb
from scrapy.http import Response    
from scrapy.exceptions import IgnoreRequest
from scrapy import log
from scrapy.conf import settings


class CustomDownloaderMiddleware(object):
   def __init__(self, *args, **kwargs):
        super(CustomDownloaderMiddleware, self).__init__(*args, **kwargs)

        self.connection = MySQLdb.connect(**settings.DATABASE)
        self.cursor = self.connection.cursor()

   def process_request(self, request, spider):
       # extracting product id from a url
       product_id = re.search(request.url, r"(\d+)$").group(1)

       # getting cached source code from the database by product id
       self.cursor.execute("""
           SELECT
               source_code
           FROM
               products
           WHERE
               product_id = %s
       """, product_id)

       source_code = self.cursor.fetchone()[0]

       # making HTTP response instance without actually hitting the web-site
       return Response(url=request.url, body=source_code)

别忘了activate the middleware。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

指向本地缓存而不是执行正常的spidering进程

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >