xpath有一个空值，这会弄乱列表

import scrapy from autotrader.items import AutotraderItem class AutotraderSpider(scrapy.Spider): name = "autotrader" allowed_domains = ["autotrader.co.uk"] start_urls = ["https://www.autotrader.co.uk/car-dealers/search?advertising-location=at_cars&postcode=m43aq&radius=1500&forSale=on&toOrder=on&sort=with-retailer-reviews&page=822"] def parse(self, response): for sel in response.xpath('//ul[@class="dealerList__container"]'): names = sel.xpath('.//*[@itemprop="legalName"]/text() ').extract() names = [name.strip() for name in names] addresses = sel.xpath('.//li/article/a/div/p[@itemprop="address"]/text()').extract() addresses = [address.strip() for address in addresses] carss = sel.xpath('.//li/article/a/div/p[@class="dealerList__itemCount"]/span/text()').extract() carss = [cars.strip() for cars in carss] result = zip(names, addresses, carss) for name, address, cars in result: item = AutotraderItem() item['name'] = name item['address'] = address item['cars'] = cars yield item

2条回答

网友

1楼 · 编辑于 2024-09-24 06:27:08

试试这个结果。您可以在scrapy项目中使用xpaths，如下所示：

class AutotraderSpider(scrapy.Spider):
    name = "autotrader"
    allowed_domains = ["autotrader.co.uk"]

    start_urls = ["https://www.autotrader.co.uk/car-dealers/search?advertising-location=at_cars&postcode=m43aq&radius=1500&forSale=on&toOrder=on&sort=with-retailer-reviews&page=822"]

    def parse(self, response):
        for items in response.xpath("//article[@class='dealerList__item']"):
            name = items.xpath(".//span[@itemprop='legalName']/text()").extract_first()
            address = ' '.join([' '.join(item.split()) for item in items.xpath(".//p[@class='dealerList__itemAddress']/text()").extract()])
            cars = items.xpath(".//span[@class='dealerList__itemCountNumber']/text()").extract_first()
            yield {"Name":name,"Address":address,"Cars":cars}

部分输出：

Midland Motors Leicester Street, Burton-On-Trent, Staffordshire DE14 3BA 2
Ns Cars 69 Eldon Street, Burton-On-Trent, Staffordshire DE15 0LT 1
RS Sales Nottingham Ltd Unit 1 TRINITY PARK, RANDALL PARK WAY, Retford, Nottinghamshire DN22 7WF 1
Adc Ltd Unit 3 HUCKNALL LANE, Nottingham, Nottinghamshire NG6 8AJ 5

网友

2楼 · 编辑于 2024-09-24 06:27:08

你的选择器循环有点混乱。你知道吗

在这里，您可以循环浏览未排序的列表，其中每个年龄段只有一个：

for sel in response.xpath('//ul[@class="dealerList__container"]'):

您要做的是遍历所有列表项：

for sel in response.xpath('//li[@class="dealerList__itemContainer"]'):

如果以这种方式循环，则可以获得每个列表项的名称、地址：

for sel in response.xpath('//li[@class="dealerList__itemContainer"]'):
    names = sel.xpath('.//*[@itemprop="legalName"]/text() ').extract()
    names = [name.strip() for name in names]
    addresses = sel.xpath('.//article/a/div/p[@itemprop="address"]/text()').extract()
    addresses = [address.strip() for address in addresses]
    carss = sel.xpath('.//article/a/div/p[@class="dealerList__itemCount"]/span/text()').extract() 
    carss = [cars.strip() for cars in carss]
    item = AutotraderItem()
    item['name'] = name
    item['address'] = address
    item['cars'] = cars
    yield item

相关问题更多 >

编程相关推荐

热门问题

热门文章