Scrapy：登录后抓取下一页

import scrapy from scrapy.http import FormRequest from ..items import ScrapytestItem from scrapy.utils.response import open_in_browser from scrapy.spiders.init import InitSpider class QuoteSpider(scrapy.Spider): name = 'scrapyquotes' login_url = 'http://quotes.toscrape.com/login' start_urls = [login_url] def parse(self,response): token = response.css('input[name="csrf_token"]::attr(value)').extract_first() yield scrapy.FormRequest(url=self.login_url,formdata={ 'csrf_token':token, 'username':'roberthng', 'password':'dsadsadsa' },callback = self.start_scraping) def start_scraping(self,response): items = ScrapytestItem() all_div_quotes=response.css('div.quote') for quotes in all_div_quotes: title = quotes.css('span.text::text').extract() author = quotes.css('.author::text').extract() tag = quotes.css('.tag::text').extract() items['title'] = title items['author'] = author items['tag'] = tag yield items #Go to Next Page: next_page = response.css('li.next a::attr(href)').get() if next_page is not None: yield response.follow(next_page, callback=self.parse)

2020-10-10 12:26:40 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapytest) 2020-10-10 12:26:40 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0 2020-10-10 12:26:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2020-10-10 12:26:40 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'scrapytest', 'NEWSPIDER_MODULE': 'scrapytest.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['scrapytest.spiders']} 2020-10-10 12:26:40 [scrapy.extensions.telnet] INFO: Telnet Password: 92d2fd08391e76a9 2020-10-10 12:26:40 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2020-10-10 12:26:40 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-10-10 12:26:40 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-10-10 12:26:40 [scrapy.middleware] INFO: Enabled item pipelines: ['scrapytest.pipelines.ScrapytestPipeline'] 2020-10-10 12:26:40 [scrapy.core.engine] INFO: Spider opened 2020-10-10 12:26:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-10-10 12:26:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-10-10 12:26:41 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2020-10-10 12:26:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/login> (referer: None) 2020-10-10 12:26:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://quotes.toscrape.com/> from <POST http://quotes.toscrape.com/login> 2020-10-10 12:26:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: http://quotes.toscrape.com/login) 2020-10-10 12:26:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': ['Albert Einstein'], 'tag': ['change', 'deep-thoughts', 'thinking', 'world'], 'title': ['“The world as we have created it is a process of our thinking. It ' 'cannot be changed without changing our thinking.”']} 2020-10-10 12:26:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': ['J.K. Rowling'], 'tag': ['abilities', 'choices'], 'title': ['“It is our choices, Harry, that show what we truly are, far more ' 'than our abilities.”']} 2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': ['Albert Einstein'], 'tag': ['inspirational', 'life', 'live', 'miracle', 'miracles'], 'title': ['“There are only two ways to live your life. One is as though ' 'nothing is a miracle. The other is as though everything is a ' 'miracle.”']} 2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': ['Jane Austen'], 'tag': ['aliteracy', 'books', 'classic', 'humor'], 'title': ['“The person, be it gentleman or lady, who has not pleasure in a ' 'good novel, must be intolerably stupid.”']} 2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': ['Marilyn Monroe'], 'tag': ['be-yourself', 'inspirational'], 'title': ["“Imperfection is beauty, madness is genius and it's better to be " 'absolutely ridiculous than absolutely boring.”']} 2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': ['Albert Einstein'], 'tag': ['adulthood', 'success', 'value'], 'title': ['“Try not to become a man of success. Rather become a man of ' 'value.”']} 2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': ['André Gide'], 'tag': ['life', 'love'], 'title': ['“It is better to be hated for what you are than to be loved for ' 'what you are not.”']} 2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': ['Thomas A. Edison'], 'tag': ['edison', 'failure', 'inspirational', 'paraphrased'], 'title': ["“I have not failed. I've just found 10,000 ways that won't work.”"]} 2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': ['Eleanor Roosevelt'], 'tag': ['misattributed-eleanor-roosevelt'], 'title': ['“A woman is like a tea bag; you never know how strong it is until ' "it's in hot water.”"]} 2020-10-10 12:26:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/> {'author': ['Steve Martin'], 'tag': ['humor', 'obvious', 'simile'], 'title': ['“A day without sunshine is like, you know, night.”']} 2020-10-10 12:26:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: http://quotes.toscrape.com/) 2020-10-10 12:26:42 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.toscrape.com/page/2/> (referer: http://quotes.toscrape.com/) Traceback (most recent call last): File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback yield next(it) File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 347, in __next__ return next(self.data) File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 347, in __next__ return next(self.data) File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable for r in iterable: File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable for r in iterable: File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr> return (_set_referer(r) for r in result or ()) File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable for r in iterable: File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable for r in iterable: File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable for r in iterable: File "C:\Users\Robert\Documents\Demos\vstoolbox\scrapytest\scrapytest\spiders\quotes_spider.py", line 15, in parse yield scrapy.FormRequest(url=self.login_url,formdata={ File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 31, in __init__ querystr = _urlencode(items, self.encoding) File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 71, in _urlencode values = [(to_bytes(k, enc), to_bytes(v, enc)) File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 71, in <listcomp> values = [(to_bytes(k, enc), to_bytes(v, enc)) File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 104, in to_bytes raise TypeError('to_bytes must receive a str or bytes ' TypeError: to_bytes must receive a str or bytes object, got NoneType 2020-10-10 12:26:42 [scrapy.core.engine] INFO: Closing spider (finished) 2020-10-10 12:26:42 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1832, 'downloader/request_count': 5, 'downloader/request_method_count/GET': 4, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 8041, 'downloader/response_count': 5, 'downloader/response_status_count/200': 3, 'downloader/response_status_count/302': 1, 'downloader/response_status_count/404': 1, 'elapsed_time_seconds': 2.063919, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 10, 10, 5, 26, 42, 486494), 'item_scraped_count': 10, 'log_count/DEBUG': 15, 'log_count/ERROR': 1, 'log_count/INFO': 10, 'request_depth_max': 2, 'response_received_count': 4, 'robotstxt/request_count': 1, 'robotstxt/response_status_count/404': 1, 'scheduler/dequeued': 4, 'scheduler/dequeued/memory': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/memory': 4, 'spider_exceptions/TypeError': 1, 'start_time': datetime.datetime(2020, 10, 10, 5, 26, 40, 422575)} 2020-10-10 12:26:42 [scrapy.core.engine] INFO: Spider closed (finished)

from itemadapter import ItemAdapter import sqlite3 class ScrapytestPipeline(object): def __init__(self): self.create_connection() self.create_table() def create_connection(self): self.conn = sqlite3.connect('myquotes.db') self.curr = self.conn.cursor() def create_table(self): self.curr.execute("""DROP TABLE IF EXISTS quotes_tb""") self.curr.execute("""create table quotes_tb( title text, author text, tag text )""") def process_item(self, item, spider): self.store_db(item) #print("Pipeline :" + item['title'][0]) return item def store_db(self, item): self.curr.execute("""insert into quotes_tb values(?,?,?)""",( item['title'][0], item['author'][0], item['tag'][0] )) self.conn.commit()

2条回答

网友

1楼 · 编辑于 2024-05-19 14:00:36

您将错误的函数传递到回调中，您的self.parse函数只能用于登录页面

if next_page is not None:
    yield response.follow(next_page, callback=self.start_scraping)

网友

2楼 · 编辑于 2024-05-19 14:00:36

这来自您的执行日志：

  File "C:\Users\Robert\Documents\Demos\vstoolbox\scrapytest\scrapytest\spiders\quotes_spider.py", line 15, in parse
    yield scrapy.FormRequest(url=self.login_url,formdata={
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 31, in __init__
    querystr = _urlencode(items, self.encoding)
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 71, in _urlencode
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 71, in <listcomp>
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 104, in to_bytes
    raise TypeError('to_bytes must receive a str or bytes '
TypeError: to_bytes must receive a str or bytes object, got NoneType

简而言之，它告诉您formdata参数中的参数是None，但它应该是“str或bytes对象”。假设formdata有三个字段，只有一个是变量，token必须返回空

    ...
    token = response.css('input[name="csrf_token"]::attr(value)').extract_first()
    yield scrapy.FormRequest(url=self.login_url,formdata={
        'csrf_token':token,
        'username':'roberthng',
        'password':'dsadsadsa'
    },callback = self.start_scraping)

但是，如果您在登录页面中，选择器将正确返回值。我的假设是，当您定义下一页的请求时，您正在将回调设置为parse方法（或者根本不设置它，这将导致parse作为默认值）。我说的是假设，因为你没有发布代码的那一部分。您的代码示例到此为止：

    #Go to Next Page:     
    next_page = response.css('li.next a::attr(href)').get()
    if next_page is not None:

因此，请确保在此之后，为请求正确设置了回调函数

相关问题更多 >

编程相关推荐

热门问题

热门文章