我对Beautiful Soup还很陌生,所以我愿意接受我可能做了一些非常愚蠢的事情,尽管如此,在阅读了文档并遵循了4个不同的在线教程之后,我并没有获得我期望的成功。但首先让我解释一下用例
目标是对度假屋网站进行搜索,例如在本例中,使用一组特定的标准,但更改日期,以便我能够确定何时可以获得最佳度假价值。我想将所有返回的结果存储到数据库中以供进一步分析
但第一步是能够捕获结果。这是我的代码
# beautiful soup libraries
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request
# Define & request the url that we want to scrape
url = r"https://www.stayz.com.au/search/keywords:warrnambool-victoria-australia/arrival:2020-10-23/departure:2020-10-25/minBedrooms/3?petIncluded=false"
html_content = urllib.request.urlopen(url)
# Pass the html_content(the webpage) through our beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
到目前为止还不错,这将返回预期页面的副本。。。我想! 所以,现在我想找到我的网页的特定部分,下面是我试图抓取的部分的截图
下面是HTML的相关部分
<div class="media-flex__body">
<h2 class="HitInfo__headline hover-text" aria-hidden="true">Merri Beach House - Opposite Beach with spectacular Views & Free Wi Fi</h2>
<span class="sr-only">Property 1: Merri Beach House - Opposite Beach with spectacular Views & Free Wi Fi</span>
<div class="HitInfo__details">
<div class="Details__propertyType Details__label" aria-hidden="true">House</div>
<div class="Details__bedrooms Details__label" aria-hidden="true">4 BR</div>
<div class="Details__bathrooms Details__label" aria-hidden="true">2 BA</div>
<div class="Details__sleeps Details__label" aria-hidden="true">Sleeps 9</div>
<div class="Details__label" aria-hidden="true">5 m<sup>2</sup></div>
<div class="sr-only"><span>Property TypeHouse</span><span>4Bedrooms</span><span>2Bathrooms</span><span>9Sleeps</span><span>5Square Meters</span></div>
</div>
<div class="GeoDistance">
<svg xmlns="http://www.w3.org/2000/svg" class="GeoDistance__icon" width="16" height="16" viewBox="0 0 16 16">
<g fill="none" fill-rule="evenodd" stroke="#5E6D77" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5">
<path class="GeoDistance__iconPinPath fill-transparent stroke-currentColor" d="M3.95 9.113a5.11 5.11 0 0 1 .546-6.579l.038-.038a5.11 5.11 0 0 1 7.226 0l.037.038a5.11 5.11 0 0 1 .548 6.58L8.147 15 3.95 9.113z"></path>
<path class="GeoDistance__iconPinHole fill-transparent stroke-currentColor" d="M9.84 6.146a1.692 1.692 0 1 1-3.387 0 1.694 1.694 0 0 1 3.387 0z"></path>
</g>
</svg>
<span class="GeoDistance__text">12 min. walk to the beach</span>
</div>
</div>
因此,根据我所阅读的内容,我应该能够执行以下搜索,这将为我提供我所需要的:
inital_search = soup.find_all('div', class_="media-flex__body")
但是,我没有得到返回的结果
我还尝试进一步向上搜索树,并启动对class="HitCollection"
的搜索,如果我理解正确,它将返回所有结果。这确实会返回一个结果,但看起来它是一个占位符,而不是实际结果
这让我想知道是否需要使用不同的方法来抓取搜索结果,而不是在抓取静态页面时所做的
下面是我第二次搜索的结果。我对网页设计不是很有经验,所以也许这对你们这些有经验的人来说是显而易见的。我非常感谢任何帮助
<div class="HitCollection HitCollection--placeholder">
<div aria-busy="true" class="Hit media-flex media-flex--left media-flex--xs" data-wdio="hit-placeholder">
<div class="LoadingPlaceholder thumbnail--noMargin media-flex__figure Hit__loadingThumbnail"><div class="LoadingPlaceholder__inner"></div></div>
<div class="Hit__infoPlaceholder--mobile" data-wdio="HitPlaceholder">
<div class="LoadingPlaceholder Hit__pricePlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__reviewsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
</div>
<div class="Hit__infoPlaceholder--desktop" data-wdio="HitPlaceholder">
<div class="LoadingPlaceholder Hit__urgencyPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__headlinePlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__infoBarPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
</div>
</div>
<div aria-busy="true" class="Hit media-flex media-flex--left media-flex--xs" data-wdio="hit-placeholder">
<div class="LoadingPlaceholder thumbnail--noMargin media-flex__figure Hit__loadingThumbnail"><div class="LoadingPlaceholder__inner"></div></div>
<div class="Hit__infoPlaceholder--mobile" data-wdio="HitPlaceholder">
<div class="LoadingPlaceholder Hit__pricePlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__reviewsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
</div>
<div class="Hit__infoPlaceholder--desktop" data-wdio="HitPlaceholder">
<div class="LoadingPlaceholder Hit__urgencyPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__headlinePlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__infoBarPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
</div>
</div>
<div aria-busy="true" class="Hit media-flex media-flex--left media-flex--xs" data-wdio="hit-placeholder">
<div class="LoadingPlaceholder thumbnail--noMargin media-flex__figure Hit__loadingThumbnail"><div class="LoadingPlaceholder__inner"></div></div>
<div class="Hit__infoPlaceholder--mobile" data-wdio="HitPlaceholder">
<div class="LoadingPlaceholder Hit__pricePlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__reviewsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
</div>
<div class="Hit__infoPlaceholder--desktop" data-wdio="HitPlaceholder">
<div class="LoadingPlaceholder Hit__urgencyPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__headlinePlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
<div class="LoadingPlaceholder Hit__infoBarPlaceholder"><div class="LoadingPlaceholder__inner"></div></div>
</div>
</div>
</div>
这将有助于您:
输出:
如果您直接想要访问
h2
标记,请使用以下命令:输出:
另外,我建议您做的另一件事是使用
selenium
而不是urllib
(因为页面是动态加载的)来获取html代码,如下所示:并将解析器从
html.parser
更改为lxml
。因此,以下是提取页面中第一个标题的最终代码:输出:
相关问题 更多 >
编程相关推荐