嵌套的JSON项与scrapy

2024-06-25 06:04:15 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我的基本抓取工具:

  def parse(self, response):        
    item = CruiseItem()     

    item['Cruise'] = {}
    item['Cruise']['Cruiseline'] = response.xpath('//title/text()').extract()
    item['Cruise']['Itinerary'] = response.xpath('//*[@id="brochureName1"]/text()').extract()
    item['Cruise']['Price'] = response.xpath('//*[@id="interiorPrice1"]/text()').extract()
    item['Cruise']['PerNight'] = response.xpath('//*[@id="perNightinteriorPrice1"]/text()').extract()

    return item

这很好地吸收了我想要的所有元素。例如,我的json提要如下所示:

^{pr2}$

但是,目标json输出是不同的:

[

{
    "Cruise": {
        "Cruiseline": [
            "Ship Name"
        ],
        "Itinerary": [
            "3 Night Bahamas "
        ],
        "Price": [
            "$169"
        ],
        "PerNight": [
            "$56/night"

        ]
    },
    "Cruise": {
        "Cruiseline": [
            "Ship Name"
        ],
        "Itinerary": [
            "4 Night Bahamas "
        ],
        "Price": [
            "$79"
        ],
        "PerNight": [
            "$86/night"
        ]
    }
}
]

基本上,我想返回每艘邮轮,每艘船,行程,价格和每晚。在

这有道理吗?很乐意讨论

编辑:几天前问过这个问题,但决定澄清并重新发布。谢谢!


Tags: textnameidjsonresponseextractitemprice
2条回答

明白了。在

def parse(self, response):

    final_list = []

    item = WthItem()

    item['ship'] = response.xpath('//*[@id="shipName1"]/text()').extract()
    item['Itinerary'] = response.xpath('//*[@id="brochureName1"]/text()').extract()
    item['Price'] = response.xpath('//*[@id="interiorPrice1"]/text()').extract()
    item['PerNight'] = response.xpath('//*[@id="perNightinteriorPrice1"]/text()').extract()

    final_list.append(item)

    updated_list = []

    for item in final_list:
        for i in range(len(item['ship'])):
            sub_item = {}
            sub_item['entry'] = {}
            sub_item['entry']['ship'] = [item['ship'][i]]
            sub_item['entry']['Itinerary'] = [item['Itinerary'][i]]
            sub_item['entry']['Price'] = [item['Price'][i]]
            sub_item['entry']['PerNight'] = [item['PerNight'][i]]
            updated_list.append(sub_item)

            print sub_item

        return updated_list

请尝试使用此脚本重新格式化数据。格式化数据将保存在updated_list

cruise_list = [

{
    "Cruise": {
        "Cruiseline": [
            "Ship Name"
        ],
        "Itinerary": [
            "3 Night Bahamas ",
            "4 Night Western Caribbean ",
            "4 Night Bahamas ",
            "3 Night Bahamas ",
            "5 Night Western Caribbean ",
            "5 Night Eastern Caribbean ",
            "7 Night Western Caribbean ",
            "7 Night Southern Caribbean ",
            "6 Night Western Caribbean ",
            "7 Night Western Caribbean ",
            "8 Night Eastern Caribbean "
        ],
        "Price": [
            "$169",
            "$179",
            "$289",
            "$349",
            "$359",
            "$389",
            "$389",
            "$409",
            "$424",
            "$524",
            "$939"
        ],
        "PerNight": [
            "$56/night",
            "$45/night",
            "$72/night",
            "$116/night",
            "$72/night",
            "$78/night",
            "$56/night",
            "$58/night",
            "$71/night",
            "$75/night",
            "$117/night"
        ]
    }
}
]

updated_list = []

for cruise_obj in cruise_list:
    cruise_data = cruise_obj['Cruise']
    for i in range(len(cruise_data['Itinerary'])):
        sub_item = {}
        sub_item['Cruise'] = {}
        sub_item['Cruise']['Cruiseline'] = cruise_data['Cruiseline']
        sub_item['Cruise']['Itinerary'] = [cruise_data['Itinerary'][i]]
        sub_item['Cruise']['Price'] = [cruise_data['Price'][i]]
        sub_item['Cruise']['PerNight'] = [cruise_data['PerNight'][i]]
        updated_list.append(sub_item)

一些其他的想法

  • 如果json中存储的只有cruise对象,那么Cruise的初始键有点多余

  • 很多时候,你在数组中存储不需要的东西。我猜这是一个棘手的问题,但您应该尝试修改一下我的脚本,以删除单数值的数组。E、 g.一个巡航对象不应该有多个Cruiseline。如果你需要帮助,请告诉我。

相关问题 更多 >