如何使用“Scrapy”刮除“JS”依赖的内容`

1条回答

网友

1楼 · 发布于 2024-10-03 17:22:48

使用scrapy shell您正在搜索的itempropxpath不可用，正如@furas所说，一些内容是由JavaScript生成的。你可以通过添加硒来获得这个内容。Selenium获取一个URL，使用web浏览器呈现它，scrapy可以正常访问结果HTML。下面的代码是开始使用Firefox的框架，但它也可以与其他浏览器一起使用。我也建议火狐也使用Firebug，这对实践xpath很有用。在

import scrapy
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import TextResponse

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

class SearchSpider(scrapy.Spider):
    name = "search"

    allowed_domains = ['www.somedomain.com']
    start_urls = ['https://www.somewebsite.com']

    def __init__(self, filename=None):
        # wire us up to selenium
        self.driver = webdriver.Firefox()
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
        self.driver.close()

    def parse(self, response):
        item = someItem()

        # Load the current page into Selenium
        self.driver.get(response.url)

        try:
            WebDriverWait(self.driver, 30).until(EC.presence_of_element_located((By.XPATH, '//span[@itemprop="name"]')))
        except TimeoutException:
            item['status'] = 'timed out'

        # Sync scrapy and selenium so they agree on the page we're looking at then let scrapy take over
        resp = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
        # scrape as normal

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用“Scrapy”刮除“JS”依赖的内容`

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >