我浪费了好几天的时间去想些无聊的事情,看一些文档和其他一些无聊的博客和问答。。。现在我要做的是男人最讨厌的事情:问路;—)问题是:我的蜘蛛打开,获取起始网址,但显然什么也不做。相反,它立即关闭,就这样。很明显,我连第一个都不知道self.log日志()声明。在
到目前为止我得到的是:
# -*- coding: utf-8 -*-
import scrapy
# from scrapy.shell import inspect_response
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse, FormRequest, Request
from KiPieSpider.items import *
from KiPieSpider.settings import *
class KiSpider(CrawlSpider):
name = "KiSpider"
allowed_domains = ['www.kiweb.de', 'kiweb.de']
start_urls = (
# ST Regra start page:
'https://www.kiweb.de/default.aspx?pageid=206',
# follow ST Regra links in the form of:
# https://www.kiweb.de/default.aspx?pageid=206&page=\d+
# https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
# ST Thermo start page:
'https://www.kiweb.de/default.aspx?pageid=202&page=1',
# follow ST Thermo links in the form of:
# https://www.kiweb.de/default.aspx?pageid=202&page=\d+
# https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
)
rules = (
# First rule that matches a given link is followed / parsed.
# Follow category pagination without further parsing:
Rule(
LinkExtractor(
# Extract links in the form:
allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
# but only within the pagination table cell:
restrict_xpaths=('//td[@id="ctl04_teaser_next"]'),
),
follow=True,
),
# Follow links to category (202|206) articles and parse them:
Rule(
LinkExtractor(
# Extract links in the form:
allow=r'Default\.aspx?pageid=299&docid=\d+',
# but only within article preview cells:
restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"),
),
# and parse the resulting pages for article content:
callback='parse_init',
follow=False,
),
)
# Once an article page is reached, check whether a login is necessary:
def parse_init(self, response):
self.log('Parsing article: %s' % response.url)
if not response.xpath('input[@value="Logout"]'):
# Note: response.xpath() is a shortcut of response.selector.xpath()
self.log('Not logged in. Logging in...\n')
return self.login(response)
else:
self.log('Already logged in. Continue crawling...\n')
return self.parse_item(response)
def login(self, response):
self.log("Trying to log in...\n")
self.username = self.settings['KI_USERNAME']
self.password = self.settings['KI_PASSWORD']
return FormRequest.from_response(
response,
formname='Form1',
formdata={
# needs name, not id attributes!
'ctl04$Header$ctl01$textbox_username': self.username,
'ctl04$Header$ctl01$textbox_password': self.password,
'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort',
'ctl04$Header$ctl01$checkbox_permanent': 'True',
},
callback = self.parse_item,
)
def parse_item(self, response):
articles = response.xpath('//div[@id="artikel"]')
items = []
for article in articles:
item = KiSpiderItem()
item['link'] = response.url
item['title'] = articles.xpath("div[@class='ct1']/text()").extract()
item['subtitle'] = articles.xpath("div[@class='ct2']/text()").extract()
item['article'] = articles.extract()
item['published'] = articles.xpath("div[@class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE")
item['artid'] = articles.xpath("div[@class='biblio']/text()").re(r"PIE \[(d+)-\d+\]")
item['lang'] = 'de-DE'
items.append(item)
# return(items)
yield items
# what is the difference between return and yield?? found both on web.
执行scrapy crawl KiSpider
时,会导致:
是不是登录例程不应该以回调结束,而是以某种return/yield语句结束?或者我做错了什么?不幸的是,到目前为止,我所看到的文档和教程只给了我一个模糊的概念,即每一个部分是如何相互联系的,尤其是Scrapy的文档似乎是作为对Scrapy有很多了解的人的参考。在
有点沮丧的问候 克里斯托弗
您不需要
allow
参数,因为XPath选择的标记中只有一个链接。在我不理解allow参数中的regex,但至少您应该转义
?
。相关问题 更多 >
编程相关推荐