Python BeautifulSoup正在抓取Div span和p标记,以及如何在Div nam上获得精确匹配

2024-06-17 16:23:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个div,我正在尝试使用相同的名称(但是页面上还有其他div的部分名称匹配,我不想要)。 首先,我只需要每个span元素中的文本。第二步,我需要span元素中的文本,对于第一个 然后我需要第2行和第3行的

标记内的文本。在

我甚至不太确定为什么需要在div的末尾进行切片(我想是因为div类col返回的不仅仅是2个相关的div,但是在div末尾添加:1似乎有帮助)

我的问题是-如何得到一个完全匹配的div名称 如何在p标签内刮擦 如何综合以上结果。我可以得到span标记内的文本,如下所示,但正如我上面所说的,我还需要p标记内的文本,并合并结果。在

数据来自此URL中的player details部分-https://www.skysports.com/football/player/141016/alisson-ramses-becker

html看起来像这样

    <div class="row-table details -bp30">
        <div class="col">
            <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>                <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>                <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>               
                        </div>
        <div class="col">
            <p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p>                <p>Position: Goal Keeper</p>
        </div>
    </div>

我的程序的相关部分

^{pr2}$

输出-

    [<p class="text-h4 title">Player Details</p>, <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>, <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>, <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>, <p>Club: <span itemprop="affiliation">Liverpool</span></p>, <p>Squad: 13</p>, <p>Position: Goal Keeper</p>]                               

我也可以通过这篇文章知道

divs = player_soup.find_all( 'div', {'class': 'col'})
for div in divs[:1]:
    spans = div.find_all('span')
    for span in spans:       
        print(span.text, ",", end=' ')

输出-

Alisson Ramses Becker , 02/10/1992 ,  Brazil , Liverpool ,              

Tags: of标记文本div名称colramsesclass
2条回答

您的主要问题是如何从<p>中提取文本,它不包含<span>。在

NavigableString一个字符串对应于标记中的一位文本。因此,如果文本是NavigableString的实例,则可以提取文本

from bs4 import BeautifulSoup,NavigableString
html = "your example"

soup = BeautifulSoup(html,"lxml")
for e in soup.find("p"):
    print(e,type(e))
#Name:  <class 'bs4.element.NavigableString'>
#<strong><span itemprop="name">Alisson Ramses Becker</span></strong> <class 'bs4.element.Tag'>

真实代码:

^{pr2}$

等于

[element for result in resultset for element in result if isinstance(element, NavigableString)]

我的完整测试代码

from bs4 import BeautifulSoup,NavigableString
html = """

    <div class="row-table details -bp30">
        <div class="col">
            <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>                <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>                <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>               
                        </div>
        <div class="col">
            <p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p>                <p>Position: Goal Keeper</p>
        </div>
    </div>
"""
soup = BeautifulSoup(html,"lxml")
resultset = soup.find_all("p")
fr = [element for result in resultset for element in result if isinstance(element, NavigableString)]
spanset = [e.text for e in soup.find_all("span",{"itemprop":True})]
setA = ["".join(z) for z in zip(fr,spanset)]
final = setA + fr[len(spanset):]
print(final)

输出

['Name: Alisson Ramses Becker', 'Date of birth:02/10/1992', 'Place of birth: Brazil', 'Club: Liverpool', 'Squad: 13', 'Position: Goal Keeper']

假设您有权放弃此站点,并且没有API或json返回,一种缓慢的方法是:

from bs4 import BeautifulSoup as bs

html = '''
 <div class="row-table details -bp30">
        <div class="col">
            <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>                <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>                <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>               
                        </div>
        <div class="col">
            <p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p>                <p>Position: Goal Keeper</p>
        </div>
    </div>
'''

soup = bs(html,'html5lib')

data = [d.find_all('p') for d in soup.find_all('div',{'class':'col'})]

value = []
for i in data:
    for j in i:
        value.append(j.text)

print(value)

相关问题 更多 >