一个站点上的Scrapy Regex可能不使用正常编码

1条回答

网友

1楼 · 发布于 2024-10-06 10:35:06

\xa0只是一个non-breaking space。你知道吗

例如，在this page上，下面是一些包含价格值的HTML：

<div class="price-box price-final_price" data-role="priceBox" data-product-id="19815">
<span class="price-container price-final_price tax weee"
         itemprop="offers" itemscope itemtype="http://schema.org/Offer">
        <span  id="product-price-19815"                data-price-amount="7490"
        data-price-type="finalPrice"
        class="price-wrapper ">
        <span class="price">7 490,00 €</span>    </span>
                <meta itemprop="price" content="7490" />
        <meta itemprop="priceCurrency" content="EUR" />
    </span>
</div>

如果您选择使用<span class="price">7 490,00 €</span>来获取价格，只需将'\xa0'替换为' '或空字符串：

$ scrapy shell https://www.garrafeiranacional.com/catalog/product/view/id/19815/s/1945-petrus-tinto/category/361/
2017-07-21 10:20:42 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(...)
2017-07-21 10:20:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.garrafeiranacional.com/catalog/product/view/id/19815/s/1945-petrus-tinto/category/361/> (referer: None)

>>> response.css('span.price').get()
'<span class="price">7\xa0490,00\xa0€</span>'
>>> response.css('span.price::text').get()
'7\xa0490,00\xa0€'

>>> response.css('span.price::text').get().replace('\u00A0', '')
'7490,00€'

另一个可能更容易在程序中消化的选项是使用页面中该价格信息的其他位置。在上面的同一个HTML片段中，您可以看到：

    <meta itemprop="price" content="7490" />
    <meta itemprop="priceCurrency" content="EUR" />

它也在<head>部分：

<meta property="product:price:amount" content="7490"/>
<meta property="product:price:currency" content="EUR"/>

相关问题更多 >

编程相关推荐

热门问题

热门文章

一个站点上的Scrapy Regex可能不使用正常编码

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >