搔痒爬行不归

2024-09-29 03:25:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我是巴西人,所以我很抱歉英语不好。我开始学习python和如何使用scrapy,我试图从表中获取信息,但由于某些原因,我编写的函数返回“None”,如您所见:

调试:从<;200https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>{'teste':无}

我试图在response.css中输入的任何类都返回“None”。我还试着从其他网站上用同样的代码获取一条文本,结果成功了,所以我猜这是关于这个网站的,但我真的不知道。有人能帮我拿一下吗

以下是我编写的代码:

import scrapy


class QuotesSpider(scrapy.Spider):

    name = "equipes"
    start_urls = ['https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/']

    def parse(self, response):
        yield {'teste': response.css('tbody tr td.tablesaw-cell-persist').get()}

Tags: 代码none网站responseallcsslistseason
1条回答
网友
1楼 · 发布于 2024-09-29 03:25:01

你的想法是正确的。数据是在javaScript的帮助下动态生成的。如果您在浏览器中禁用javaScript并转到下拉列表并尝试更改表名,那么您将看到它永远不会像“CBLOL Split 1 2020”更改为“CBLOL Academy Split 2 2021”不会改变,这个行为被称为动态填充javaScript数据,因为您没有通过获取静态HTML来获取数据。这就是为什么你需要无头浏览器来获取数据。事实上,我们无法修复一种技术来刮取一个站点,而不是,一个站点向我们展示了我们必须使用什么技术来刮取一个站点。在这里我使用Selenium with Scrapy,它也像刮痧蜘蛛一样超级快速

我的代码:

import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from time import sleep


class TeamsSpider(scrapy.Spider):
    name = 'teams'
    allowed_domains = ['gol.gg']
    start_urls = ['https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/']
   
    def __init__(self):
        chrome_options = Options()
            #chrome_options.add_argument(" headless")

        chrome_path = which("chromedriver")

        self.driver = webdriver.Chrome(executable_path=chrome_path)#, options=chrome_options)
        self.driver.set_window_size(1920, 1080)
        self.driver.get("https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/")
        sleep(5)
        dropDown = self.driver.find_element_by_xpath('//*[@id="cbtournament"]/option[text()= "CBLOL Split 1 2020"]')
        dropDown.click()
        sleep(5)

        self.html = self.driver.page_source
        self.driver.close()
    
    def parse(self, response):
        
        resp = Selector(text=self.html)
        for tr in resp.xpath('(//tbody)[2]/tr'):
            yield {
                'Name': tr.xpath(".//td/a/text()").get(),
                'Season': tr.xpath(".//td[2]/text()").get(),
                'Region': tr.xpath(".//td[3]/text()").get(),
                'Games': tr.xpath(".//td[4]/text()").get(),
                'winRate': tr.xpath(".//td[5]/text()").get()
                    
            }

输出:

2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Flamengo eSports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '61.9%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'FURIA Uppercut', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'INTZ e-Sports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '38.1%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'KaBuM! e-Sports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'paiN Gaming', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '47.6%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Prodigy Esports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Redemption POA', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '28.6%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Vivo Keyd', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '66.7%'}
2021-08-04 13:59:37 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-04 13:59:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 1078,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.668237,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 8, 4, 7, 59, 37, 58803),
 'httpcompression/response_bytes': 278,
 'httpcompression/response_count': 2,
 'item_scraped_count': 8,


       
        

相关问题 更多 >