Python/Scrapy:crawspider在获取起始url后停止

2024-09-19 23:45:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我浪费了好几天的时间去想些无聊的事情,看一些文档和其他一些无聊的博客和问答。。。现在我要做的是男人最讨厌的事情:问路;—)问题是:我的蜘蛛打开,获取起始网址,但显然什么也不做。相反,它立即关闭,就这样。很明显,我连第一个都不知道self.log日志()声明。在

到目前为止我得到的是:

# -*- coding: utf-8 -*-
import scrapy
# from scrapy.shell import inspect_response
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse, FormRequest, Request
from KiPieSpider.items import *
from KiPieSpider.settings import *

class KiSpider(CrawlSpider):
    name = "KiSpider"
    allowed_domains = ['www.kiweb.de', 'kiweb.de']
    start_urls = (
        # ST Regra start page:
        'https://www.kiweb.de/default.aspx?pageid=206',
            # follow ST Regra links in the form of:
            # https://www.kiweb.de/default.aspx?pageid=206&page=\d+
            # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
        # ST Thermo start page:
        'https://www.kiweb.de/default.aspx?pageid=202&page=1',
            # follow ST Thermo links in the form of:
            # https://www.kiweb.de/default.aspx?pageid=202&page=\d+ 
            # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
    )
    rules = (
        # First rule that matches a given link is followed / parsed.
        # Follow category pagination without further parsing:
        Rule(
            LinkExtractor(
                # Extract links in the form:
                allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
                # but only within the pagination table cell:
                restrict_xpaths=('//td[@id="ctl04_teaser_next"]'),
            ),
            follow=True,
        ),
        # Follow links to category (202|206) articles and parse them:
        Rule(
            LinkExtractor(
                # Extract links in the form:
                allow=r'Default\.aspx?pageid=299&docid=\d+',
                # but only within article preview cells:
                restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"),
            ),
            # and parse the resulting pages for article content:
            callback='parse_init',
            follow=False,
        ),
    )

    # Once an article page is reached, check whether a login is necessary:
    def parse_init(self, response):
        self.log('Parsing article: %s' % response.url)
        if not response.xpath('input[@value="Logout"]'):
            # Note: response.xpath() is a shortcut of response.selector.xpath()
            self.log('Not logged in. Logging in...\n')
            return self.login(response)
        else:
            self.log('Already logged in. Continue crawling...\n')
            return self.parse_item(response)


    def login(self, response):
        self.log("Trying to log in...\n")
        self.username = self.settings['KI_USERNAME']
        self.password = self.settings['KI_PASSWORD']
        return FormRequest.from_response(
            response,
            formname='Form1',
            formdata={
                # needs name, not id attributes!
                'ctl04$Header$ctl01$textbox_username': self.username,
                'ctl04$Header$ctl01$textbox_password': self.password,
                'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort',
                'ctl04$Header$ctl01$checkbox_permanent': 'True',
            },
            callback = self.parse_item,
        )

    def parse_item(self, response):
        articles = response.xpath('//div[@id="artikel"]')
        items = []
        for article in articles:
            item = KiSpiderItem()
            item['link'] = response.url
            item['title'] = articles.xpath("div[@class='ct1']/text()").extract()
            item['subtitle'] = articles.xpath("div[@class='ct2']/text()").extract()
            item['article'] = articles.extract()
            item['published'] = articles.xpath("div[@class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE")
            item['artid'] = articles.xpath("div[@class='biblio']/text()").re(r"PIE \[(d+)-\d+\]")
            item['lang'] = 'de-DE'
            items.append(item)
#       return(items)
        yield items
#       what is the difference between return and yield?? found both on web.

执行scrapy crawl KiSpider时,会导致:

^{pr2}$

是不是登录例程不应该以回调结束,而是以某种return/yield语句结束?或者我做错了什么?不幸的是,到目前为止,我所看到的文档和教程只给了我一个模糊的概念,即每一个部分是如何相互联系的,尤其是Scrapy的文档似乎是作为对Scrapy有很多了解的人的参考。在

有点沮丧的问候 克里斯托弗


Tags: infromimportselfresponsewwwpagede
1条回答
网友
1楼 · 发布于 2024-09-19 23:45:10
rules = (
        # First rule that matches a given link is followed / parsed.
        # Follow category pagination without further parsing:
        Rule(
            LinkExtractor(
                # Extract links in the form:
                # allow=r'Default\.aspx?pageid=(202|206])&page=\d+',

                # but only within the pagination table cell:
                restrict_xpaths=('//td[@id="ctl04_teaser_next"]'),
            ),
            follow=True,
        ),
        # Follow links to category (202|206) articles and parse them:
        Rule(
            LinkExtractor(
                # Extract links in the form:
                # allow=r'Default\.aspx?pageid=299&docid=\d+',
                # but only within article preview cells:
                restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"),
            ),
            # and parse the resulting pages for article content:
            callback='parse_init',
            follow=False,
        ),
    )

您不需要allow参数,因为XPath选择的标记中只有一个链接。在

我不理解allow参数中的regex,但至少您应该转义?enter image description here

相关问题 更多 >