刮取数据前,Scrapy重定向到注销

2024-09-29 06:24:17 发布

您现在位置:Python中文网/ 问答频道 /正文

标题可能有点混乱,但让我再解释一下。我正试图建立一个简单的刮板使用刮板刮一些自动预算的银行网站。到目前为止,我似乎可以登录,但在我注销后,却没有获得所需的数据。以下是我的终端上的一些文字:

1. 2018-03-27 00:56:56 [scrapy.core.engine] DEBUG: Crawled (200) <POST 
   https://www.bank.org/signin-page.html> (referer: 
   https://www.bank.org/signin-page.html)
2. 2018-03-27 00:56:56 [LOG] INFO: LOGIN ATTEMPT SUCCESSFUL
3. 2018-03-27 00:56:56 [scrapy.core.engine] DEBUG: Crawled (404) <GET 
   https://www.bankonline.org/robots.txt> (referer: None)
4. 2018-03-27 00:56:56 [scrapy.downloadermiddlewares.redirect] DEBUG: 
   Redirecting (302) to <GET https://www.bankonline.org/tob/live/usp- 
   core/app/logout?reason=logout> from <GET 
   https://www.bankonline.org/tob/live/usp-core/app/home>
5. 2018-03-27 00:56:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
   https://www.bankonline.org/tob/live/usp-core/app/logout? 
   reason=logout> (referer: https://www.bank.org/signin-page.html)
6. 2018-03-27 00:56:56 [LOG] INFO: VISITED 
   https://www.bankonline.org/tob/live/usp-core/app/logout? 
   reason=logout
7. 2018-03-27 00:56:57 [scrapy.core.engine] INFO: Closing spider 
   (finished)

第4行是它开始重定向我的地方。这是我的密码:

import scrapy
import logging

logger = logging.getLogger('LOG')
USERNAME = 'user'
PASSWORD = 'pass'

class Budget_Bank(scrapy.Spider):
    name = "Budget_Bank"
    login_url = 'https://www.bank.org/signin-page.html'
    start_urls = ['https://www.bank.org/signin-page.html']

    def parse(self, response):
        yield scrapy.FormRequest(url=self.login_url,
                                 formdata={'username': USERNAME,
                                           'password': PASSWORD},
                                 callback=self.login_test)


    def login_test(self, response):
        if 'errors' in response.text:
            logger.warning("LOGIN ATTEMPT FAILED")
            return
        else:
            logger.info("LOGIN ATTEMPT SUCCESSFUL")
            yield scrapy.Request('https://www.bankonline.org'
                                 '/tob/live/usp-core/app/home',
                                 callback=self.parse_number)


    def parse_number(self, response):
        logger.info("VISITED %s", response.url)
        for number in response.css('div._1qtcLoK1d4PZmeghcgyE2K'):
            yield {
                'num': number.css('span.formattedMoney_balanceBZozG-'         
                                  ...::text').extract_first(),
            }

到目前为止,我只是想从网站上获取一个数字,以测试我是否能够真正检索数据。我的login_测试返回我正确登录,但它没有继续进入主页,而是将我重定向到注销。由于明显的原因,我省略了一些信息,比如我的用户名和密码,而且我还更改了网站名称。如果能帮上点忙,我将不胜感激


Tags: httpsorgcoreselfliveresponsehtmlwww
1条回答
网友
1楼 · 发布于 2024-09-29 06:24:17

您被重定向到注销,因为它检测到您是机器人

3. 2018-03-27 00:56:56 [scrapy.core.engine] DEBUG: Crawled (404) <GET 
   https://www.bankonline.org/robots.txt> (referer: None)

您可以尝试将ROBOTSTXT_OBEY设置为False

有关更多信息,请参阅Doc

相关问题 更多 >