如何使用BeautifulSoup获取文本标记?

2024-06-02 23:14:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我对BS4和网络抓取还不熟悉,所以对于这样一个基本的问题,我提前表示歉意

我正在抓取啤酒倡导者网站(https://www.beeradvocate.com/beer/?view=recent),我不知道如何抓取ABV内容,主要是因为我不确定应该使用哪个标签。根据HTML工具,标记是#文本,但我不确定如何处理它

有人知道如何提取这些信息吗

多谢各位

enter image description here


Tags: https网络comview内容网站www标签
2条回答

要获得酒精含量和啤酒品牌,可以使用以下示例:

import re
from bs4 import BeautifulSoup
import requests

url = 'https://www.beeradvocate.com/beer/?view=recent'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

r = re.compile(r'([\d.]+)% ABV$')
for t in soup.find_all(text=r):
    name = t.find_previous('h6').text
    amount = r.search(t).group(1)
    print('{:<50} {}%'.format(name, amount))

印刷品:

Granát (BrouCzech Dark)                            5%
HopTime Harvest Ale                                6%
Direct Current                                     6.8%
Hopzilla Double IPA                                8.7%
Dankful                                            7.4%
Cancun Commie                                      11.5%
Welcome Young One                                  8.2%
Lick the Spoon                                     12%
Split Open and Melt                                8%
Speedway Stout                                     12%
What Mask?                                         8.4%
Switch Lanes                                       7%
Hella Juice Bag                                    8.2%
Down By The River                                  4.9%
Road Town                                          7.5%
Manhattan Social Club                              12.5%
Flash Kick                                         8.2%
Naked Brunch                                       8.5%
Tiki Breeze                                        7%
Oberon - Mango                                     5.8%
Eldest Brother                                     11%
Bliss                                              8%
Watou Tripel                                       7.5%
Respect Your Elders                                7.25%
Braxton Labs Smoothie Sour: Tropical               4.8%
Heaven Scent                                       5.5%
Oktoberfest                                        6.5%
Phaser                                             6.5%
Mark It Zero!                                      12%
Lake George IPA                                    6.8%
Triangled IPA (⟁)                                  8%
Broo Doo                                           7%
Porter                                             6.5%
Imperial Porter - Rum Barrel Aged w/ Coconut       7.2%
Willow                                             7.1%
State of the Art - Orange DIPA                     8.7%
Fest-Beer                                          5.9%
Boskeun                                            10%
Smuttlabs Baja Hoodie                              8.4%
Trappist Achel 8° Bruin                            8%
Double Dry Hopped Double Mosaic Dream              8.5%
Falcon Smash                                       7.4%
Hazy Wonder                                        6%
Mango Wango                                        7.5%
North Park                                         5%
The Tomb                                           10.2%
Cashmere Hammer                                    6.5%
Chonk Sundae Sour (Peanut Butter and Jelly)        4.3%
The Tearing Of Flesh From Bone                     8.2%
Oktoberfest                                        6.1%

下面介绍如何使用bs4查找文本,然后使用正则表达式提取所有ABV匹配字符串

from bs4 import BeautifulSoup
import re

webpage = "YOUR_WEBPAGE_STRING"

soup = BeautifulSoup(webpage, features="html.parser")
txt = soup.text

x = re.findall("^| \d+% ABV", txt)

print(x)

对给定链接执行此操作时,将获得如下输出:

['', ' 5% ABV', ' 6% ABV', ' 12% ABV', ' 8% ABV', ' 12% ABV', ' 7% ABV', ' 7% ABV', ' 11% ABV', ' 8% ABV', ' 12% ABV', ' 8% ABV', ' 7% ABV', ' 10% ABV', ' 8% ABV', ' 6% ABV', ' 5% ABV']

相关问题 更多 >