需要帮助从一个固定的网址和动态加载内容的网站刮酒店名单?

2024-07-02 12:15:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从一个酒店列表网站this site上搜集细节。 在这里,当我们为下一页单击next按钮时,url保持不变,当使用inspect元素查看时,站点正在发送XHR请求。我尝试使用seleniumwebdriver和python,下面是我的代码

from time import sleep
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException

class DineoutRestaurantSpider(scrapy.Spider):
    name = 'dineout_restaurant'
    allowed_domains = ['dineout.co.in/bangalore-restaurants?search_str=']
    start_urls = ['http://dineout.co.in/bangalore-restaurants?search_str=']
    def start_requests(self):
        self.driver = webdriver.Chrome('/Users/macbookpro/Downloads/chromedriver')
        self.driver.get('https://www.dineout.co.in/bangalore-restaurants?search_str=')'

url = 'https://www.dineout.co.in/bangalore-restaurants?search_str='
**yield Request(url, callback=self.parse)**
self.logger.info('Empty message')

for i in range(1, 4):
    try:
        next_page = self.driver.find_element_by_xpath('//a[text()="Next "]')
        sleep(11)
        self.logger.info('Sleeping for 11 seconds.')
        next_page.click()
        url = 'https://www.dineout.co.in/bangalore-restaurants?search_str='
        yield Request(url, callback=self.parse)

    except NoSuchElementException:
        self.logger.info('No more pages to load.')
        self.driver.quit()
        break

def parse(self, response):
self.logger.info('Entered parse method')
restaurants = response.xpath('//*[@class="cardBg"]')
for restaurant in restaurants:
     name = restaurant.xpath('.//*[@class="titleDiv"]/h4/a/text()').extract_first()
     location = restaurant.xpath('.//*[@class="location"]/a/text()').extract()
     rating = restaurant.xpath('.//*[@class="rating rating-5"]/a/span/text()').extract_first()
     yield{
            'Name': name,
            'Location': location,
            'Rating': rating,
            }`

在上面的代码中,yield请求没有转到parse函数?我遗漏了什么吗?我没有得到任何错误,但scrape输出只是第1页,即使页面正在迭代


Tags: infromimportselfurlsearchparserestaurant