如何找到没有链接的h1标记html格式的文本？

网友

1楼 · 编辑于 2024-09-28 22:19:57

您可以使用regex和beauthoulsoup的组合：

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
for link in soup.find_all('a', string=re.compile(r'^text link')):
    print link

这将查找以text link开头的所有链接。在

网友

2楼 · 编辑于 2024-09-28 22:19:57

导航到<h1>，并从.stripped_strings生成器获取第一个字符串：

>>> from bs4 import BeautifulSoup
>>> next(BeautifulSoup(html).select_one('h1.titleClass').stripped_strings)
'Text title here'

网友

3楼 · 编辑于 2024-09-28 22:19:57

您可以获取整个h1标记，然后提取任何链接，如下所示：

from bs4 import BeautifulSoup

html = """<h1 class="titleClass" itemprop="name">
    Text title here
    <a class="titleLink" href="somelink-here.html">
        text link here
    </a>
</h1>"""

soup = BeautifulSoup(html)

p = soup.find('h1', attrs={'class': 'titleClass'})
p.a.extract()
print p.text.strip()

这将显示：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何找到没有链接的h1标记html格式的文本？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >