每次启动时更新mysql的Scrapy管道

import scrapy import MySQLdb import MySQLdb.cursors from scrapy.http.request import Request from youtubephase2.items import Youtubephase2Item class youtubephase2(scrapy.Spider): name = 'youtubephase2' def start_requests(self): conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True) cursor = conn.cursor() cursor.execute('SELECT resultURL FROM SearchResults;') rows = cursor.fetchall() for row in rows: if row: yield Request(row[0], self.parse) cursor.close() def parse(self, response): for sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'): item = Youtubephase2Item() item['pageurl'] = sel.xpath('@href').extract() yield item

1条回答

网友

1楼 · 发布于 2024-09-30 14:25:32

您可以使用meta请求参数在相关请求和项之间传递相关信息：

def start_requests(self):
    conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True)
    cursor = conn.cursor()
    cursor.execute('SELECT resultURL FROM SearchResults;')
    rows = cursor.fetchall()

    for row in rows:
        if row:
            yield Request(row[0], self.parse, meta=dict(start_url=row[0]))
    cursor.close()

def parse(self, response):
    for sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'):
        item = Youtubephase2Item()
        item['pageurl'] = sel.xpath('@href').extract()
        item['start_url'] = response.meta['start_url']
        yield item

现在，您还可以使用response.url，但这可能会因为重定向或其他东西而改变，因此它以后可能会与数据库中的有所不同。在

最后，您必须更新管道，以将item['start_url']作为cursor.execute中的start_url参数进行传递

相关问题更多 >

编程相关推荐

热门问题

热门文章