如何将项目与数据库中的记录匹配?

2024-09-07 12:29:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我将URL存储在数据库表中:

 scrapy_id  | scrapy_name   |    url    
------------+---------------+-----------------
        111 |       aaa     |  http://url1.com   
        222 |       bbb     |  http://url2.com 
        333 |       ccc     |  http://url3.com   

我需要从URL启动请求,所以我在管道的open\u spider中初始化数据库连接:

class PgsqlPipeline(object):

...

    def open_spider(self, spider):
        self.conn = psycopg2.connect(database=self.XXX, user=self.XXX, password=self.XXX)
        self.cur = self.conn.cursor()
        spider.myPipeline = self

    def get_urls(self):
        get_urls_sql = """
        SOME_SQL_STATMENTS
        """

        self.cur.execute(get_urls_sql)
        rows = self.cur.fetchall()
        return rows

...

然后,在spider中:

....

class SephoraSpider(Spider):
    name = 'XXX'
    allowed_domains = ['XXX']

    def start_requests(self):
        for row in self.myPipeline.get_urls():
            self.item = SomeItem()
            url = str(row[2])
            self.item['id'] = row[0]
            self.item['name'] = row[1]
            yield Request(self.url, callback=self.parse_item)

    def parse_item(self, response):
        self.item['text'] = response.xpath('XXXX').get()
        return self.item

....

在项目中:

....

class SomeItem(Item):
    id = Field()
    name = Field()
    text = Field()
....

我想得到以下物品:

{
    "id": 111,
    "name": "aaa",
    "text": response1,
},
{
    "id": 222,
    "name": "bbb",
    "text": response2,
},
{
    "id": 333,
    "name": "ccc",
    "text": response3,
}

但我得到:

{
    "id": 333,
    "name": "ccc",
    "text": response1,
},
{
    "id": 333,
    "name": "ccc",
    "text": response2,
},
{
    "id": 333,
    "name": "ccc",
    "text": response3,
}

问题可能是我将self.item = SomeItem()放在start\u requests()中,但是如果我将self.item = SomeItem()放在parse\u item()中,我无法获得idname,这意味着无法将解析的响应与其ID匹配

如何将项目与数据库中的记录匹配


Tags: textnameselfid数据库urlgetdef
1条回答
网友
1楼 · 发布于 2024-09-07 12:29:01

您不能使用self来存储请求元数据,因为您只在启动请求时设置它;您需要将数据与请求一起持久化,而不是与SephoraSpider类实例一起持久化。在parse_item回调中,它将被设置为上次启动的请求的值。相反,您可以使用Request.meta字段:

class SephoraSpider(Spider):
    name = 'XXX'
    allowed_domains = ['XXX']

    def start_requests(self):
        for row in self.myPipeline.get_urls():
            url = str(row[2])
            item = {'id': row[0], 'name': row[1], 'url': row[2]}
            yield Request(self.url, callback=self.parse_item, meta={'item': item})

    def parse_item(self, response):
        item = response.meta['item']
        item['text'] = response.xpath('XXXX').get()
        return item

详情请参见docs

相关问题 更多 >