我一直在尝试创建一个新闻站点,将每篇文章存储在mySQL数据库中。我的目标是在新闻网站上存储每篇文章的以下数据:日期、标题、摘要、链接
我一直在尝试不同的方法,在尝试了几个星期后,我决定在stackoverflow上来这里寻求解决问题的方法。(注意:我有一个代码几乎可以解决我的问题,但它只一次取出所有项目,而不是一个接一个,因此我尝试了一种新的方法,这里是我碰壁的地方)
SPIDER.PY
import scrapy
from ..items import WebspiderItem
class NewsSpider(scrapy.Spider):
name = 'news'
start_urls = [
'https://www.coindesk.com/feed'
]
def parse(self, response):
for date in response.xpath('//pubDate/text()').extract():
yield WebspiderItem(date = date)
for title in response.xpath('//title/text()').extract():
yield WebspiderItem(title = title)
for summary in response.xpath('//description/text()').extract():
yield WebspiderItem(summary = summary)
for link in response.xpath('//link/text()').extract():
yield WebspiderItem(link = link)
ITEMS.PY
import scrapy
class WebspiderItem(scrapy.Item):
date = scrapy.Field()
title = scrapy.Field()
summary = scrapy.Field()
link = scrapy.Field()
管道。PY
import mysql.connector
class WebspiderPipeline(object):
def __init__(self):
self.create_connection()
self.create_table()
def create_connection(self):
self.conn = mysql.connector.connect(
host='localhost',
user='root',
passwd='HIDDENPASSWORD',
database='news_db'
)
self.curr = self.conn.cursor()
def create_table(self):
self.curr.execute("""DROP TABLE IF EXISTS news_tb""")
self.curr.execute("""create table news_tb(
date text,
title text,
summary text,
link text
)""")
def process_item(self, item, spider):
self.store_db(item)
return item
def store_db(self, item):
self.curr.execute("""insert into news_tb values (%s, %s, %s, %s)""", (
item['date'],
item['title'],
item['summary'],
item['link']
))
self.conn.commit()
响应 其中多个:
2020-03-17 07:54:32 [scrapy.core.scraper] ERROR: Error processing {'link': 'https://www.coindesk.com/makerdaos-problems-are-a-textbook-case-of-governance-failure'}
Traceback (most recent call last):
File "c:\users\r\pycharmprojects\project\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\r\PycharmProjects\Project\webspider v3 RSS\webspider\pipelines.py", line 36, in process_item
self.store_db(item)
File "C:\Users\r\PycharmProjects\Project\webspider v3 RSS\webspider\pipelines.py", line 41, in store_db
item['date'],
File "c:\users\r\pycharmprojects\_project\venv\lib\site-packages\scrapy\item.py", line 91, in __getitem__
return self._values[key]
KeyError:
您应该一次生成所有数据,不要在循环中这样做,python从上到下读取代码,您首先生成日期,管道接收到日期,然后尝试查找值title、summary和link,其缺失现在返回KeyError
相关问题 更多 >
编程相关推荐