Scrapy用户登录不使用FormRequest.from_response()

2024-10-03 19:24:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我刚刚设置了一个简单的Scrapy Spider来抓取一些受用户登录保护的数据。
几个小时以来,我试图使用本地登录名(第二个)用ScrapyFormRequest.from_response()登录以下网站:https://www.campus.uni-erlangen.de/

正如我从控制台输出中看到的,数据被发布到服务器(请参阅所附的日志文件-->;REDIRECTING 302),但在此重定向之后,登录未成功。我猜这和饼干有关,但我没能解决这个问题

另外,您将在常规登录期间找到捕获的网络流量

出于测试原因,我尝试使用相同的代码登录到http://quotes.toscrape.com/login。这很有效

剪贴代码:

import scrapy
from scrapy.shell import inspect_response
from urllib.parse import urlparse


class QuotesSpider(scrapy.Spider):
    name = "meincampus3"
    start_urls = ['https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0']

    def parse(self, response):
        full_url = response.xpath('//*[@id="loginform"]/@action').extract_first()
        query = urlparse(full_url).query
        login_url = "https://www.campus.uni-erlangen.de/qisserver/rds?" + query
        print("fullurl: "+full_url)
        print("query: "+query)
        print("loginurl: " + login_url)

        return scrapy.FormRequest.from_response(
            response,
            url=login_url,
            formid='loginform',
            clickdata={'type': 'submit'},
            formdata={
                'username': 'UN',
                'password': 'PASSWD',
                'submit': 'Anmelden',
            },
            callback=self.after_login,
        )


    def after_login(self, response):
        inspect_response(response, self)

控制台输出:

(env) johannesschilling@Johannes-MBP scrapytutorial % scrapy crawl meincampus3
2020-08-04 11:29:28 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: scrapytutorial)
2020-08-04 11:29:28 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, Jun  4 2020, 19:29:32) - [Clang 11.0.3 (clang-1103.0.32.62)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform Darwin-19.5.0-x86_64-i386-64bit
2020-08-04 11:29:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-08-04 11:29:28 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapytutorial',
 'NEWSPIDER_MODULE': 'scrapytutorial.spiders',
 'SPIDER_MODULES': ['scrapytutorial.spiders']}
2020-08-04 11:29:28 [scrapy.extensions.telnet] INFO: Telnet Password: 239396afe0083e22
2020-08-04 11:29:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-08-04 11:29:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-04 11:29:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-04 11:29:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-04 11:29:28 [scrapy.core.engine] INFO: Spider opened
2020-08-04 11:29:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-04 11:29:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-04 11:29:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0> (referer: None)
fullurl: /qisserver/rds;jsessionid=6EA68E318E6C51D64DEFD0EA6C33417A.cit-prod-tomcat1;xuser=6DB5F329336AE359E3568D30A10EBC75.cit-prod-tomcat1?state=user&type=1
query: state=user&type=1
loginurl: https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=1
2020-08-04 11:29:29 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0&category=menu.browse&breadCrumbSource=&startpage=portal.vm> from <POST https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=1>
2020-08-04 11:29:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0&category=menu.browse&breadCrumbSource=&startpage=portal.vm> (referer: https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x10a1cb5d0>
[s]   item       {}
[s]   request    <GET https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0&category=menu.browse&breadCrumbSource=&startpage=portal.vm>
[s]   response   <200 https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0&category=menu.browse&breadCrumbSource=&startpage=portal.vm>
[s]   settings   <scrapy.settings.Settings object at 0x10b1ed710>
[s]   spider     <QuotesSpider 'meincampus3' at 0x10b04cc50>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> 

Networktraffic POST-request general

Networktraffic POST-request request header

Networktraffic POST-request Formdata

登录尝试:

view(response) with wrong password

view(response) with correct password

POST-Info from correct manual login

GET-Info from correct manual login


Tags: httpsinforesponsewwwtypedeuniscrapy
1条回答
网友
1楼 · 发布于 2024-10-03 19:24:54

我不确定这是否有效,因为我没有有效的用户名或密码,但from_响应给出的url似乎与我们在网络流量中看到的url不同。我还将指定表单(以确保FormRequest正在查看正确的表单)和clickdata(以确保我们模拟单击正确的按钮):

from urllib.parse import urlparse

full_url = response.xpath('//*[@id="loginform"]/@action').extract_first()
query = urlparse(full_url).query
login_url = "https://www.campus.uni-erlangen.de/qisserver/rds?" + query
headers = {
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'Upgrade-Insecure-Requests': '1',
    'Origin': 'https://www.campus.uni-erlangen.de',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Referer': 'https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0',
    'Accept-Language': 'en-GB,en;q=0.9,nl-BE;q=0.8,nl;q=0.7,ro-RO;q=0.6,ro;q=0.5,en-US;q=0.4,fr;q=0.3,it;q=0.2',
}

return scrapy.FormRequest.from_response(
            response,
            url=login_url,
            formid='loginform',
            clickdata={'type': 'submit'},
            formdata={
                'username': 'USERNAME',
                'password': 'PASSWD',
                'submit': 'Anmelden',
            },
            callback=self.after_login,
            headers=headers
        )

相关问题 更多 >