使用Python请求抓取网页

3条回答

网友

1楼 · 编辑于 2024-10-01 07:13:00

我使用scrapy-selenium和selenium stealth得到Response [200]

代码：

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium_stealth import stealth
from selenium import webdriver
from shutil import which 
from selenium.webdriver.chrome.options import Options

class AtpSpider(scrapy.Spider):
    name = 'atptour'
    chrome_path = which("chromedriver") 
    chrome_options = Options()
    chrome_options.add_argument(" headless")
    
    driver = webdriver.Chrome(executable_path=chrome_path,options=chrome_options)
    stealth(driver,user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36',
    languages=["en-US", "en"], 
    vendor="Google Inc.", 
    platform="Win32",
    webgl_vendor="Intel Inc.", 
    renderer="Intel Iris OpenGL Engine",
    fix_hairline=False) 
  
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.atptour.com/en/scores/results-archive?year=2016',
            wait_time =5,
            callback = self.parse,
        
        )
    def parse(self, response):
        pass

输出：

2021-07-31 10:25:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-07-31 10:25:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atptour.com/en/scores/results-archive> (referer: None)
2021-07-31 10:25:05 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-31 10:25:05 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:53662/session/039ca0bb0a64b7b9eb48ab26a0f464a0 {}
2021-07-31 10:25:05 [urllib3.connectionpool] DEBUG: http://127.0.0.1:53662 "DELETE /session/039ca0bb0a64b7b9eb48ab26a0f464a0 HTTP/1.1" 200 14
2021-07-31 10:25:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-07-31 10:25:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 15142,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,

网友

2楼 · 编辑于 2024-10-01 07:13:00

这些网站受到Cloudflare的保护，并希望在访问网站时启用javascript，就像真正的浏览器一样，requests库无法执行。因此，您可以尝试使用Selenium

另一件事我注意到在headless模式中使用Selenium会抛出captcha，但non-headless有效。最后，您可以使用Beautifusoup进行解析

试试这个：

from selenium import webdriver
from bs4 import BeautifulSoup

chrome_path = ('Add your chromedriver path here')
driver = webdriver.Chrome(executable_path=chrome_path)

url = 'https://www.atptour.com/en/scores/results-archive?year=2016'
driver.get(url)
data = driver.page_source

soup = BeautifulSoup(data, 'html.parser')
table = soup.find('table', class_="results-archive-table mega-table")
print(table)

driver.quit()

网友

3楼 · 编辑于 2024-10-01 07:13:00

看看答案：

print(page)
<Response [403]>

也许你必须在你的请求中添加一些标题

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用Python请求抓取网页

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >