搔痒爬行不归

1条回答

网友

1楼 · 发布于 2024-09-29 03:25:01

你的想法是正确的。数据是在javaScript的帮助下动态生成的。如果您在浏览器中禁用javaScript并转到下拉列表并尝试更改表名，那么您将看到它永远不会像“CBLOL Split 1 2020”更改为“CBLOL Academy Split 2 2021”不会改变，这个行为被称为动态填充javaScript数据，因为您没有通过获取静态HTML来获取数据。这就是为什么你需要无头浏览器来获取数据。事实上，我们无法修复一种技术来刮取一个站点，而不是，一个站点向我们展示了我们必须使用什么技术来刮取一个站点。在这里我使用Selenium with Scrapy，它也像刮痧蜘蛛一样超级快速

我的代码：

import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from time import sleep


class TeamsSpider(scrapy.Spider):
    name = 'teams'
    allowed_domains = ['gol.gg']
    start_urls = ['https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/']
   
    def __init__(self):
        chrome_options = Options()
            #chrome_options.add_argument(" headless")

        chrome_path = which("chromedriver")

        self.driver = webdriver.Chrome(executable_path=chrome_path)#, options=chrome_options)
        self.driver.set_window_size(1920, 1080)
        self.driver.get("https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/")
        sleep(5)
        dropDown = self.driver.find_element_by_xpath('//*[@id="cbtournament"]/option[text()= "CBLOL Split 1 2020"]')
        dropDown.click()
        sleep(5)

        self.html = self.driver.page_source
        self.driver.close()
    
    def parse(self, response):
        
        resp = Selector(text=self.html)
        for tr in resp.xpath('(//tbody)[2]/tr'):
            yield {
                'Name': tr.xpath(".//td/a/text()").get(),
                'Season': tr.xpath(".//td[2]/text()").get(),
                'Region': tr.xpath(".//td[3]/text()").get(),
                'Games': tr.xpath(".//td[4]/text()").get(),
                'winRate': tr.xpath(".//td[5]/text()").get()
                    
            }

输出：

2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Flamengo eSports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '61.9%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'FURIA Uppercut', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'INTZ e-Sports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '38.1%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'KaBuM! e-Sports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'paiN Gaming', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '47.6%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Prodigy Esports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Redemption POA', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '28.6%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Vivo Keyd', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '66.7%'}
2021-08-04 13:59:37 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-04 13:59:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 1078,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.668237,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 8, 4, 7, 59, 37, 58803),
 'httpcompression/response_bytes': 278,
 'httpcompression/response_count': 2,
 'item_scraped_count': 8,

相关问题更多 >

编程相关推荐

热门问题

热门文章

搔痒爬行不归

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >