我已经在这里和其他网站上读了很多关于“刮痧”的文章,我不能解决这个问题,所以我问你:p希望有人能帮助我。在
我想在主客户端页面验证登录,然后解析所有类别,然后分析所有产品,并保存产品的标题、类别、数量和价格。在
我的代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging
class article(Item):
category = Field()
title = Field()
quantity = Field()
price = Field()
class combatzone_spider(CrawlSpider):
name = 'combatzone_spider'
allowed_domains = ['www.combatzone.es']
start_urls = ['http://www.combatzone.es/areadeclientes/']
rules = (
Rule(LinkExtractor(allow=r'/category.php?id=\d+'),follow=True),
Rule(LinkExtractor(allow=r'&page=\d+'),follow=True),
Rule(LinkExtractor(allow=r'goods.php?id=\d+'),follow=True,callback='parse_items'),
)
def init_request(self):
logging.info("You are in initRequest")
return Request(url=self,callback=self.login)
def login(self,response):
logging.info("You are in login")
return scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)
def check_login_response(self,response):
logging.info("You are in checkLogin")
if "Hola,XXXX" in response.body:
self.log("Succesfully logged in.")
return self.initialized()
else:
self.log("Something wrong in login.")
def parse_items(self,response):
logging.info("You are in item")
item = scrapy.loader.ItemLoader(article(),response)
item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
yield item.load_item()
当我在终端上运行这个脏兮兮的爬行蜘蛛时,我得到了这样的信息:
SCRAPY) pi@raspberry:~/SCRAPY/combatzone/combatzone/spiders $ scrapy crawl combatzone_spider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9: ScrapyDeprecationWarning: Module
scrapy.contrib.spiders
is deprecated, usescrapy.spiders
instead from scrapy.contrib.spiders.init import InitSpider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9: ScrapyDeprecationWarning: Modulescrapy.contrib.spiders.init
is deprecated, usescrapy.spiders.init
instead from scrapy.contrib.spiders.init import InitSpider 2018-07-24 22:14:53 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: combatzone) 2018-07-24 22:14:53 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.13 (default, Nov 24 2017, 17:33:09) - [GCC 6.3.0 20170516], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Linux-4.9.0-6-686-i686-with-debian-9.5 2018-07-24 22:14:53 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'combatzone.spiders', 'SPIDER_MODULES': ['combatzone.spiders'], 'LOG_LEVEL': 'INFO', 'BOT_NAME': 'combatzone'} 2018-07-24 22:14:53 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2018-07-24 22:14:53 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-07-24 22:14:53 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-07-24 22:14:53 [scrapy.middleware] INFO: Enabled item pipelines: [] 2018-07-24 22:14:53 [scrapy.core.engine] INFO: Spider opened 2018-07-24 22:14:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-07-24 22:14:54 [scrapy.core.engine] INFO: Closing spider (finished) 2018-07-24 22:14:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 231, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 7152, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 7, 24, 21, 14, 54, 410938), 'log_count/INFO': 7, 'memusage/max': 36139008, 'memusage/startup': 36139008, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2018, 7, 24, 21, 14, 53, 998619)} 2018-07-24 22:14:54 [scrapy.core.engine] INFO: Spider closed (finished)
蜘蛛好像不工作了,知道为什么会这样吗? 非常感谢各位朋友:D
有两个问题:
/category.php?id=\d+
应该改为/category.php\?id=\d+
(注意“\?”)至于登录,我试图让你的代码工作,但我失败了。我通常在爬网之前重写
start_requests
以登录。在代码如下:
相关问题 更多 >
编程相关推荐