我刚刚设置了一个简单的Scrapy Spider来抓取一些受用户登录保护的数据。
几个小时以来,我试图使用本地登录名(第二个)用ScrapyFormRequest.from_response()
登录以下网站:https://www.campus.uni-erlangen.de/
正如我从控制台输出中看到的,数据被发布到服务器(请参阅所附的日志文件-->;REDIRECTING 302),但在此重定向之后,登录未成功。我猜这和饼干有关,但我没能解决这个问题
另外,您将在常规登录期间找到捕获的网络流量
出于测试原因,我尝试使用相同的代码登录到http://quotes.toscrape.com/login。这很有效
剪贴代码:
import scrapy
from scrapy.shell import inspect_response
from urllib.parse import urlparse
class QuotesSpider(scrapy.Spider):
name = "meincampus3"
start_urls = ['https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0']
def parse(self, response):
full_url = response.xpath('//*[@id="loginform"]/@action').extract_first()
query = urlparse(full_url).query
login_url = "https://www.campus.uni-erlangen.de/qisserver/rds?" + query
print("fullurl: "+full_url)
print("query: "+query)
print("loginurl: " + login_url)
return scrapy.FormRequest.from_response(
response,
url=login_url,
formid='loginform',
clickdata={'type': 'submit'},
formdata={
'username': 'UN',
'password': 'PASSWD',
'submit': 'Anmelden',
},
callback=self.after_login,
)
def after_login(self, response):
inspect_response(response, self)
控制台输出:
(env) johannesschilling@Johannes-MBP scrapytutorial % scrapy crawl meincampus3
2020-08-04 11:29:28 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: scrapytutorial)
2020-08-04 11:29:28 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, Jun 4 2020, 19:29:32) - [Clang 11.0.3 (clang-1103.0.32.62)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.0, Platform Darwin-19.5.0-x86_64-i386-64bit
2020-08-04 11:29:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-08-04 11:29:28 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapytutorial',
'NEWSPIDER_MODULE': 'scrapytutorial.spiders',
'SPIDER_MODULES': ['scrapytutorial.spiders']}
2020-08-04 11:29:28 [scrapy.extensions.telnet] INFO: Telnet Password: 239396afe0083e22
2020-08-04 11:29:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-08-04 11:29:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-04 11:29:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-04 11:29:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-04 11:29:28 [scrapy.core.engine] INFO: Spider opened
2020-08-04 11:29:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-04 11:29:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-04 11:29:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0> (referer: None)
fullurl: /qisserver/rds;jsessionid=6EA68E318E6C51D64DEFD0EA6C33417A.cit-prod-tomcat1;xuser=6DB5F329336AE359E3568D30A10EBC75.cit-prod-tomcat1?state=user&type=1
query: state=user&type=1
loginurl: https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=1
2020-08-04 11:29:29 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0&category=menu.browse&breadCrumbSource=&startpage=portal.vm> from <POST https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=1>
2020-08-04 11:29:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0&category=menu.browse&breadCrumbSource=&startpage=portal.vm> (referer: https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x10a1cb5d0>
[s] item {}
[s] request <GET https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0&category=menu.browse&breadCrumbSource=&startpage=portal.vm>
[s] response <200 https://www.campus.uni-erlangen.de/qisserver/rds?state=user&type=0&category=menu.browse&breadCrumbSource=&startpage=portal.vm>
[s] settings <scrapy.settings.Settings object at 0x10b1ed710>
[s] spider <QuotesSpider 'meincampus3' at 0x10b04cc50>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>
Networktraffic POST-request general
Networktraffic POST-request request header
Networktraffic POST-request Formdata
登录尝试:
view(response) with wrong password
view(response) with correct password
我不确定这是否有效,因为我没有有效的用户名或密码,但from_响应给出的url似乎与我们在网络流量中看到的url不同。我还将指定表单(以确保FormRequest正在查看正确的表单)和clickdata(以确保我们模拟单击正确的按钮):
相关问题 更多 >
编程相关推荐