Python,bs4:解析时找不到inspection中的标记

2024-10-03 02:46:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我遇到了一个意想不到的问题,我正在使用python3.5和BeautifulSoup。 我要分析以下链接:

url = 'https://www.leboncoin.fr/chaussures/627533472.htm?ca=16_s'
import requests, bs4
res = requests.get(url)
res.raise_for_status()
DicoSoup = bs4.BeautifulSoup(res.text, "lxml")

我有兴趣检索到提供的图片链接。 当我检查网站的html时,我发现在tag div下面有“thumbnails”类,它们在tag span下面有“item\u imagePic”类,它们是img标签

但是,当我选择div标记时,却找不到span标记:

div = DicoSoup.select("div.thumbnails")

div
Out[54]: 
[<div class="thumbnails" data-alt="Talons aiguilles Stéphane Kélian - 37.5">
 <ul>
 <li class="thumb selected trackable" data-info='{"event_name" :             "ad_view::photos", "event_type" : "click", "click_type" : "N", "event_s2" : "2"}' id="thumb_0"></li>
 <li class="thumb trackable" data-info='{"event_name" : "ad_view::photos", "event_type" : "click", "click_type" : "N", "event_s2" : "2"}' id="thumb_1">          </li>
 <li class="thumb trackable" data-info='{"event_name" : "ad_view::photos", "event_type" : "click", "click_type" : "N", "event_s2" : "2"}' id="thumb_2"></li>
 </ul>
 </div>]

当我检查html内容时,我看到的是:

<div class="thumbnails" data-alt="Talons aiguilles Stéphane Kélian - 37.5" style="width: 596px;">
                        <ul style="">

                                <li id="thumb_0" class="thumb selected trackable" data-info="{&quot;event_name&quot; : &quot;ad_view::photos&quot;, &quot;event_type&quot; : &quot;click&quot;, &quot;click_type&quot; : &quot;N&quot;, &quot;event_s2&quot; : &quot;2&quot;}"><span class="item_imagePic"><img src="//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg" alt="Talons aiguilles Stéphane Kélian - 37.5"></span></li>

                                <li id="thumb_1" class="thumb trackable" data-info="{&quot;event_name&quot; : &quot;ad_view::photos&quot;, &quot;event_type&quot; : &quot;click&quot;, &quot;click_type&quot; : &quot;N&quot;, &quot;event_s2&quot; : &quot;2&quot;}"><span class="item_imagePic"><img src="//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg" alt="Talons aiguilles Stéphane Kélian - 37.5"></span></li>

                                <li id="thumb_2" class="thumb trackable" data-info="{&quot;event_name&quot; : &quot;ad_view::photos&quot;, &quot;event_type&quot; : &quot;click&quot;, &quot;click_type&quot; : &quot;N&quot;, &quot;event_s2&quot; : &quot;2&quot;}"><span class="item_imagePic"><img src="//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg" alt="Talons aiguilles Stéphane Kélian - 37.5"></span></li>

                        </ul>
                    </div>

怎么可能? 我需要做什么来选择它们?你知道吗

我试过:

div = DicoSoup.select_one("div.thumbnails span.item_imagePic")
div = DicoSoup.select_one("div.thumbnails ul li span.item_imagePic")
div = DicoSoup.select("div.thumbnails ul li span.item_imagePic")
span = DicoSoup.find('span', {'class': 'item_imagePic'})
span = DicoSoup.find('span',id="thumb_0")
div = DicoSoup.select("div.thumbnails img")
div = DicoSoup.select("div.thumbnails span img")
div = DicoSoup.select("div.thumbnails ul li span.item_imagePic img")

它们都返回“NoneType”类型的对象

谢谢你


Tags: diveventimgdatatypeliitemclass
1条回答
网友
1楼 · 发布于 2024-10-03 02:46:14

正如我所评论的,缩略图是使用JS动态生成的,但是您可以获取脚本并解析路径:

soup = BeautifulSoup(requests.get("https://www.leboncoin.fr/chaussures/627533472.htm?ca=16_s").content)

script = soup.select_one("div.thumbnails").find_next("script")
print(script.text.strip())

这给了你:

var images = new Array(), images_thumbs = new Array();
                        images_thumbs[0] = "//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg"; 
              images[0] = "//img0.leboncoin.fr/images/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg";

                        images_thumbs[1] = "//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg"; 
              images[1] = "//img1.leboncoin.fr/images/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg";

                        images_thumbs[2] = "//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg"; 
              images[2] = "//img2.leboncoin.fr/images/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg";

要获取图像链接:

import re


soup = BeautifulSoup(requests.get("https://www.leboncoin.fr/chaussures/627533472.htm?ca=16_s").content)

script = soup.select_one("div.thumbnails").find_next("script").text

print(re.findall("images_thumbs\[\d+\]\s+=\s+\"(.*?)\";", script))

或者只是分割线和条带:

 [s.split("=", 1)[1].strip('"; ') for s in script.splitlines() if s.strip().startswith("images_thumbs")]

两者都给你:

[u'//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg', u'//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg', u'//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg']
[u'//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg', u'//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg', u'//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg']

最后,您只需要预先准备一个方案,即https

 ["https://"+ path for path in re.findall("images_thumbs\[\d+\]\s+=\s+\"(.*?)\";", script)]

相关问题 更多 >