PYTHON Scrapy |从项目插入MySQL

2024-06-28 10:21:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我一直在尝试创建一个新闻站点,将每篇文章存储在mySQL数据库中。我的目标是在新闻网站上存储每篇文章的以下数据:日期、标题、摘要、链接

我一直在尝试不同的方法,在尝试了几个星期后,我决定在stackoverflow上来这里寻求解决问题的方法。(注意:我有一个代码几乎可以解决我的问题,但它只一次取出所有项目,而不是一个接一个,因此我尝试了一种新的方法,这里是我碰壁的地方)

SPIDER.PY

    import scrapy
    from ..items import WebspiderItem


    class NewsSpider(scrapy.Spider):
        name = 'news'
        start_urls = [
            'https://www.coindesk.com/feed'
        ]

        def parse(self, response):

            for date in response.xpath('//pubDate/text()').extract():
                yield WebspiderItem(date = date)


            for title in response.xpath('//title/text()').extract():
                yield WebspiderItem(title = title)


            for summary in response.xpath('//description/text()').extract():
                yield WebspiderItem(summary = summary)


            for link in response.xpath('//link/text()').extract():
                yield WebspiderItem(link = link)

ITEMS.PY

import scrapy


class WebspiderItem(scrapy.Item):
    date = scrapy.Field()
    title = scrapy.Field()
    summary = scrapy.Field()
    link = scrapy.Field()

管道。PY

import mysql.connector


class WebspiderPipeline(object):

    def __init__(self):
        self.create_connection()
        self.create_table()

    def create_connection(self):
        self.conn = mysql.connector.connect(
            host='localhost',
            user='root',
            passwd='HIDDENPASSWORD',
            database='news_db'
        )
        self.curr = self.conn.cursor()

    def create_table(self):
        self.curr.execute("""DROP TABLE IF EXISTS news_tb""")
        self.curr.execute("""create table news_tb(
                        date text,
                        title text,
                        summary text,
                        link text
                        )""")

    def process_item(self, item, spider):
        self.store_db(item)
        return item

    def store_db(self, item):
        self.curr.execute("""insert into news_tb values (%s, %s, %s, %s)""", (
            item['date'],
            item['title'],
            item['summary'],
            item['link']

        ))
        self.conn.commit()

响应 其中多个:

2020-03-17 07:54:32 [scrapy.core.scraper] ERROR: Error processing {'link': 'https://www.coindesk.com/makerdaos-problems-are-a-textbook-case-of-governance-failure'}
Traceback (most recent call last):
  File "c:\users\r\pycharmprojects\project\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\r\PycharmProjects\Project\webspider v3 RSS\webspider\pipelines.py", line 36, in process_item
    self.store_db(item)
  File "C:\Users\r\PycharmProjects\Project\webspider v3 RSS\webspider\pipelines.py", line 41, in store_db
    item['date'],
  File "c:\users\r\pycharmprojects\_project\venv\lib\site-packages\scrapy\item.py", line 91, in __getitem__
    return self._values[key]
KeyError:

Tags: textinselfdbdatetitleresponsedef
1条回答
网友
1楼 · 发布于 2024-06-28 10:21:55

您应该一次生成所有数据,不要在循环中这样做,python从上到下读取代码,您首先生成日期,管道接收到日期,然后尝试查找值title、summary和link,其缺失现在返回KeyError

class NewsSpider(scrapy.Spider):
        name = 'news'
    def start_requests(self):
        page = 'https://www.coindesk.com/feed'
        yield scrapy.Request(url=page, callback=self.parse)

    def parse(self, response):
        links = response.xpath('//link/text()').extract()
        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_contents)

    def parse_contents(self, response):
        url = response.url
        article_title = response.xpath('//h1/text()').extract()[0]
        pub_date = response.xpath('//div[@class="article-hero-datetime"]/time/@datetime').extract()[0]
        description = response.xpath('//meta[@name="description"]/@content').extract()[0]
        item = WebspiderItem()
        item['date'] = pub_date
        item['title'] = article_title
        item['summary'] = description
        item['link'] = url

        yield item

相关问题 更多 >