无法从网站BeautifulSoup4中刮取特定内容

2024-07-05 11:41:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我很难通过python3,beautifulsoup4抓到这个链接

http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining

我只想得到这个部分。你知道吗

When you are in ...

Capitol City Grille
This downtown Lansing restaurant offers ...

Capitol City Grille Lounge
For a glass of wine or a ...

Room Service
If you prefer ...

我有这个密码

 for rest in dining_page_soup.select("div.copy_left p strong"):

      if rest.next_sibling is not None:
         if rest.next_sibling.next_sibling is not None:
               title = rest.text
               desc = rest.next_sibling.next_sibling
               print ("Title:  "+title)
               print (desc)

但它给了我TypeError: 'NoneType' object is not callable

desc = rest.next_sibling.next_sibling上,我甚至有一个if语句来检查它是否是None。你知道吗


Tags: innoneyourestcityifisnot
2条回答

如果您不介意使用xpath,这应该可以

import requests
from lxml import html

url = "http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining"
page = requests.get(url).text
tree = html.fromstring(page)

xp_t = "//*[@class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/text()"
xp_d = "//*[@class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()[not(following-sibling::strong)]"

titles = tree.xpath(xp_t)
descriptions = tree.xpath(xp_d)  # still contains garbage like '\r\n'
descriptions = [d.strip() for d in descriptions if d.strip()]

for t, d in zip(titles, descriptions):
    print("{title}: {description}".format(title=t, description=d))

这里的描述包含3个元素:“这个市中心…”,“为了一个杯子…”,“如果你喜欢…”。你知道吗

如果您还需要“When you are the mood…”,请替换为:

xp_d = "//*[@class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()"

这是一个非常简单的解决方案

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining")
data = r.text
soup = BeautifulSoup(data)
for found_text in soup.select('div.copy_left'):
    print found_text.text 

更新

根据问题的一个改进,这里是一个使用RE的解决方案。 必须为第一段“当你……”制定具体的解决办法,因为它不尊重其他段落的结构。你知道吗

for tag in soup.find_all(re.compile("^strong")):

    title = tag.text
    desc = tag.next_sibling.next_sibling
    print ("Title:  "+title)
    print (desc)

输出

Title: Capitol City Grille

This downtown Lansing restaurant offers delicious, contemporary American cuisine in an upscale yet relaxed environment. You can enjoy dishes that range from fluffy pancakes to juicy filet mignon steaks. Breakfast and lunch buffets are available, as well as an à la carte menu.

Title: Capitol City Grille Lounge

For a glass of wine or a hand-crafted cocktail and great conversation, spend an afternoon or evening at Capitol City Grille Lounge with friends or colleagues.

Title: Room Service

If you prefer to dine in the comfort of your own room, order from the room service menu.

Title: Menus

Breakfast Menu

Title: Capitol City Grille Hours

Breakfast, 6:30-11 a.m.

Title: Capitol City Grille Lounge Hours

Mon-Thu, 11 a.m.-11 p.m.

Title: Room Service Hours

Daily, 6:30 a.m.-2 p.m. and 5-10 p.m.

相关问题 更多 >