碎屑物品没有得到适当的处理

2024-09-29 04:21:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我的spider中有两个for循环,一个用于图像,一个用于房间数据。它们在独立运行时都可以正常工作,但如果将它们都放在我的爬行器中,取决于哪一个先到,它将正确地提供图像URL或房间数据,但不能同时提供两者。我已经试着在收益率方面做了一些改变,并且阅读了关于运行多个spider的文档,但是我只想知道我做错了什么

这是我的代码,我对Scrapy非常陌生,刚刚了解了用于格式化数据的项目加载器,所以我还没有使用过这些

items.py

import scrapy

class ResortItem(scrapy.Item):

    # images
    images = scrapy.Field()
    image_urls = scrapy.Field()

    # room details and amenities
    room_title = scrapy.Field()
    square_feet = scrapy.Field()
    kitchen = scrapy.Field()
    num_baths = scrapy.Field()
    max_guests = scrapy.Field()
    beds = scrapy.Field()
    washer_dryer = scrapy.Field()
    room_amenities = scrapy.Field()

刮刀

import scrapy
from items import ResortItem

class ScraperSpider(scrapy.Spider):
    name = 'scraper'
    allowed_domains = ['domains']
    start_urls = [
        'urls'
    ]
def parse(self, response):
        item = ResortItem()
            unit_img_path = units_img.xpath(unit_image_selector).getall()

            url_list = imgs_path + unit_img_path
            image_urls = [
                "url" + x for x in url_list]
            item['image_urls'] = image_urls
            yield item
            # gets and sets the room_title to an item
            room_title = units.xpath(room_nameSelector).get().strip()
            item['room_title'] = room_title
            beds = units.xpath(bedSelector).getall()
            item['beds'] = beds
            num_baths = units.xpath(bathsSelector).get().strip()
            item['num_baths'] = num_baths
            # gets the square feet and sets it to an item
            square_feet = units.xpath(sqftSelector).get().strip()
            item['square_feet'] = square_feet
            room_amenities = units.xpath(room_amenitiesSelector).getall()

            # Pulls Washer/Dryer amenity if available
            washer_amenity = 'Washer'
            washer_dryer = list(
                filter(lambda x: washer_amenity in x, room_amenities))

            # Extracts the washer/dryer room_amenities list
            # setting room_amenities item
            room_amenities = [
                x for x in room_amenities if not x.startswith('Washer')]
            item['room_amenities'] = room_amenities

            # formatting Kitchen data
            # setting kitchens item
            kitchen = units.xpath(kitchenSelector).get().strip()
            item['kitchen'] = kitchen

            yield item

Tags: imagefieldtitleitemurlsxpathnumscrapy
1条回答
网友
1楼 · 发布于 2024-09-29 04:21:48

移动这个

        unit_image_selector = './/div[@class = "orbit-wrapper"]/ul//li/figure/img//@src'
        unit_img_path = units_img.xpath(unit_image_selector).getall()

        url_list = imgs_path + unit_img_path
        image_urls = [
            "https://clubwyndham.wyndhamdestinations.com" + x for x in url_list]
        item['image_urls'] = image_urls

在第二个循环内

unit = './/div[contains(@id, "unit-details")]'
for units in response.xpath(unit):

删除第一个循环和不必要的变量

相关问题 更多 >