我将URL存储在数据库表中:
scrapy_id | scrapy_name | url
------------+---------------+-----------------
111 | aaa | http://url1.com
222 | bbb | http://url2.com
333 | ccc | http://url3.com
我需要从URL启动请求,所以我在管道的open\u spider中初始化数据库连接:
class PgsqlPipeline(object):
...
def open_spider(self, spider):
self.conn = psycopg2.connect(database=self.XXX, user=self.XXX, password=self.XXX)
self.cur = self.conn.cursor()
spider.myPipeline = self
def get_urls(self):
get_urls_sql = """
SOME_SQL_STATMENTS
"""
self.cur.execute(get_urls_sql)
rows = self.cur.fetchall()
return rows
...
然后,在spider中:
....
class SephoraSpider(Spider):
name = 'XXX'
allowed_domains = ['XXX']
def start_requests(self):
for row in self.myPipeline.get_urls():
self.item = SomeItem()
url = str(row[2])
self.item['id'] = row[0]
self.item['name'] = row[1]
yield Request(self.url, callback=self.parse_item)
def parse_item(self, response):
self.item['text'] = response.xpath('XXXX').get()
return self.item
....
在项目中:
....
class SomeItem(Item):
id = Field()
name = Field()
text = Field()
....
我想得到以下物品:
{
"id": 111,
"name": "aaa",
"text": response1,
},
{
"id": 222,
"name": "bbb",
"text": response2,
},
{
"id": 333,
"name": "ccc",
"text": response3,
}
但我得到:
{
"id": 333,
"name": "ccc",
"text": response1,
},
{
"id": 333,
"name": "ccc",
"text": response2,
},
{
"id": 333,
"name": "ccc",
"text": response3,
}
问题可能是我将self.item = SomeItem()
放在start\u requests()中,但是如果我将self.item = SomeItem()
放在parse\u item()中,我无法获得id
和name
,这意味着无法将解析的响应与其ID匹配
如何将项目与数据库中的记录匹配
您不能使用
self
来存储请求元数据,因为您只在启动请求时设置它;您需要将数据与请求一起持久化,而不是与SephoraSpider
类实例一起持久化。在parse_item
回调中,它将被设置为上次启动的请求的值。相反,您可以使用Request.meta
字段:详情请参见docs
相关问题 更多 >
编程相关推荐