为什么我不能让BeautifulSoup按描述工作?

2024-06-27 09:32:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我对Beautiful Soup还很陌生,所以我愿意接受我可能做了一些非常愚蠢的事情,尽管如此,在阅读了文档并遵循了4个不同的在线教程之后,我并没有获得我期望的成功。但首先让我解释一下用例

目标是对度假屋网站进行搜索,例如在本例中,使用一组特定的标准,但更改日期,以便我能够确定何时可以获得最佳度假价值。我想将所有返回的结果存储到数据库中以供进一步分析

但第一步是能够捕获结果。这是我的代码

# beautiful soup libraries
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request

# Define & request the url that we want to scrape
url = r"https://www.stayz.com.au/search/keywords:warrnambool-victoria-australia/arrival:2020-10-23/departure:2020-10-25/minBedrooms/3?petIncluded=false"
html_content = urllib.request.urlopen(url)

# Pass the html_content(the webpage) through our beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')

到目前为止还不错,这将返回预期页面的副本。。。我想! 所以,现在我想找到我的网页的特定部分,下面是我试图抓取的部分的截图

下面是HTML的相关部分

<div class="media-flex__body"> <h2 class="HitInfo__headline hover-text" aria-hidden="true">Merri Beach House - Opposite Beach with spectacular Views &amp; Free Wi Fi</h2> <span class="sr-only">Property 1: Merri Beach House - Opposite Beach with spectacular Views &amp; Free Wi Fi</span> <div class="HitInfo__details"> <div class="Details__propertyType Details__label" aria-hidden="true">House</div> <div class="Details__bedrooms Details__label" aria-hidden="true">4 BR</div> <div class="Details__bathrooms Details__label" aria-hidden="true">2 BA</div> <div class="Details__sleeps Details__label" aria-hidden="true">Sleeps 9</div> <div class="Details__label" aria-hidden="true">5 m<sup>2</sup></div> <div class="sr-only"><span>Property TypeHouse</span><span>4Bedrooms</span><span>2Bathrooms</span><span>9Sleeps</span><span>5Square Meters</span></div> </div> <div class="GeoDistance"> <svg xmlns="http://www.w3.org/2000/svg" class="GeoDistance__icon" width="16" height="16" viewBox="0 0 16 16"> <g fill="none" fill-rule="evenodd" stroke="#5E6D77" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5"> <path class="GeoDistance__iconPinPath fill-transparent stroke-currentColor" d="M3.95 9.113a5.11 5.11 0 0 1 .546-6.579l.038-.038a5.11 5.11 0 0 1 7.226 0l.037.038a5.11 5.11 0 0 1 .548 6.58L8.147 15 3.95 9.113z"></path> <path class="GeoDistance__iconPinHole fill-transparent stroke-currentColor" d="M9.84 6.146a1.692 1.692 0 1 1-3.387 0 1.694 1.694 0 0 1 3.387 0z"></path> </g> </svg> <span class="GeoDistance__text">12 min. walk to the beach</span> </div> </div>

因此,根据我所阅读的内容,我应该能够执行以下搜索,这将为我提供我所需要的:

inital_search = soup.find_all('div', class_="media-flex__body")

但是,我没有得到返回的结果

我还尝试进一步向上搜索树,并启动对class="HitCollection"的搜索,如果我理解正确,它将返回所有结果。这确实会返回一个结果,但看起来它是一个占位符,而不是实际结果

这让我想知道是否需要使用不同的方法来抓取搜索结果,而不是在抓取静态页面时所做的

下面是我第二次搜索的结果。我对网页设计不是很有经验,所以也许这对你们这些有经验的人来说是显而易见的。我非常感谢任何帮助

<div class="HitCollection HitCollection--placeholder"> <div aria-busy="true" class="Hit media-flex media-flex--left media-flex--xs" data-wdio="hit-placeholder"> <div class="LoadingPlaceholder thumbnail--noMargin media-flex__figure Hit__loadingThumbnail"><div class="LoadingPlaceholder__inner"></div></div> <div class="Hit__infoPlaceholder--mobile" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__pricePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__reviewsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> <div class="Hit__infoPlaceholder--desktop" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__urgencyPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__headlinePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__infoBarPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> </div> <div aria-busy="true" class="Hit media-flex media-flex--left media-flex--xs" data-wdio="hit-placeholder"> <div class="LoadingPlaceholder thumbnail--noMargin media-flex__figure Hit__loadingThumbnail"><div class="LoadingPlaceholder__inner"></div></div> <div class="Hit__infoPlaceholder--mobile" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__pricePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__reviewsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> <div class="Hit__infoPlaceholder--desktop" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__urgencyPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__headlinePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__infoBarPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> </div> <div aria-busy="true" class="Hit media-flex media-flex--left media-flex--xs" data-wdio="hit-placeholder"> <div class="LoadingPlaceholder thumbnail--noMargin media-flex__figure Hit__loadingThumbnail"><div class="LoadingPlaceholder__inner"></div></div> <div class="Hit__infoPlaceholder--mobile" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__pricePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__reviewsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> <div class="Hit__infoPlaceholder--desktop" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__urgencyPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__headlinePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__infoBarPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> </div> </div>

Tags: divtruedatastrokedetailsmediahiddenclass
1条回答
网友
1楼 · 发布于 2024-06-27 09:32:24

这将有助于您:

from bs4 import BeautifulSoup

html = '<div class="media-flex__body"><h2 class="HitInfo__headline hover-text" aria-hidden="true">Merri Beach House - Opposite Beach with spectacular Views &amp; Free Wi Fi</h2><span class="sr-only">Property 1: Merri Beach House - Opposite Beach with spectacular Views &amp; Free Wi Fi</span><div class="HitInfo__details"><div class="Details__propertyType Details__label" aria-hidden="true">House</div><div class="Details__bedrooms Details__label" aria-hidden="true">4 BR</div><div class="Details__bathrooms Details__label" aria-hidden="true">2 BA</div><div class="Details__sleeps Details__label" aria-hidden="true">Sleeps 9</div><div class="Details__label" aria-hidden="true">5 m<sup>2</sup></div><div class="sr-only"><span>Property TypeHouse</span><span>4Bedrooms</span><span>2Bathrooms</span><span>9Sleeps</span><span>5Square Meters</span></div></div><div class="GeoDistance"><svg xmlns="http://www.w3.org/2000/svg" class="GeoDistance__icon" width="16" height="16" viewBox="0 0 16 16"><g fill="none" fill-rule="evenodd" stroke="#5E6D77" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5"><path class="GeoDistance__iconPinPath fill-transparent stroke-currentColor" d="M3.95 9.113a5.11 5.11 0 0 1 .546-6.579l.038-.038a5.11 5.11 0 0 1 7.226 0l.037.038a5.11 5.11 0 0 1 .548 6.58L8.147 15 3.95 9.113z"></path><path class="GeoDistance__iconPinHole fill-transparent stroke-currentColor" d="M9.84 6.146a1.692 1.692 0 1 1-3.387 0 1.694 1.694 0 0 1 3.387 0z"></path></g></svg><span class="GeoDistance__text">12 min. walk to the beach</span></div></div>'

soup = BeautifulSoup(html,'html5lib')

div = soup.find('div',class_ = "media-flex__body")

print(div.h2.text)

输出:

Merri Beach House - Opposite Beach with spectacular Views & Free Wi Fi

如果您直接想要访问h2标记,请使用以下命令:

h2 = soup.find('h2',class_ = "HitInfo__headline hover-text")

print(h2.text)

输出:

Merri Beach House - Opposite Beach with spectacular Views & Free Wi Fi

另外,我建议您做的另一件事是使用selenium而不是urllib(因为页面是动态加载的)来获取html代码,如下所示:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
html_content = driver.page_source

并将解析器从html.parser更改为lxml。因此,以下是提取页面中第一个标题的最终代码:

from bs4 import BeautifulSoup
from selenium import webdriver
import time
# Define & request the url that we want to scrape
url = r"https://www.stayz.com.au/search/keywords:warrnambool-victoria-australia/arrival:2020-10-23/departure:2020-10-25/minBedrooms/3?petIncluded=false"

driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
html_content = driver.page_source
soup = BeautifulSoup(html_content,'lxml')
driver.close()

div = soup.find('div',class_ = "media-flex__body")

print(div.h2.text)

输出:

Merri Beach House - Opposite Beach with spectacular Views & Free Wi Fi

相关问题 更多 >