Scrapy:登录>爬网未按预期工作

2024-09-30 14:26:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在努力学习登录我的蜘蛛。为此,我创建了附加的代码。预期结果是:

{
   "username": "willingc",
   "email": "carolcode@willingconsulting.com",
   "url": "https://www.willingconsulting.com",
}

然而,实际结果是:

{
   "username": "willingc",
   "email": None,
   "url": "https://www.willingconsulting.com",
}

None通常在浏览器未登录时发生。你看到我的代码中有错误吗?我看到的唯一一个错误的指标是以下警告:

WARNING:py.warnings:/workspace/.pip-modules/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py:287: RuntimeWarning: Could not load referrer policy 'origin-when-cross-origin, strict-origin-when-cross-origin'

import scrapy
from scrapy.http import FormRequest


class GitHubSpider(scrapy.Spider):
    name = "github"
    allowed_domains = ["github.com"]
    start_urls = ["https://github.com/login"]

    def parse(self, response):
        token = response.xpath('//form/input[@name="authenticity_token"]/@value').get()
        return FormRequest.from_response(
            response,
            formdata={
                "authenticity_token": token,
                "login": "mygithub@gmail.com",
                "password": "12345",
            },
            callback=self.parse_after_login,
        )

    def parse_after_login(self, response):
        yield scrapy.Request(
            url="https://github.com/willingc",
            callback=self.parse_engineer,
        )

    def parse_engineer(self, response):

        yield {
            "username": response.css(".vcard-username::text").get().strip(),
            "email": response.xpath('//li[@itemprop="email"]/a//text()').get(),
            "url": response.xpath('//li[@itemprop="url"]/a//@href').get(),
        }

Tags: httpsselfgithubcomtokenurlgetparse