用beauthoulsoup在两个i标签之间找到一个标签

网友

1楼 · 编辑于 2024-05-02 23:20:39

你可以使用beauthoulsoup和regex的组合。这里regex用于获取limit标记之间的所有内容，然后使用BeautifulSoup提取锚定标记。在

from bs4 import BeautifulSoup
import re

excerpts = re.findall(r'<i>Hello<\\i>(.*?)<i>Bye<\\i>', html, re.DOTALL)

for e in excerpts:
    soup = BeautifulSoup(e)
    for link in soup.findAll('a'):
        print(link)

输出：

^{pr2}$

网友

2楼 · 编辑于 2024-05-02 23:20:39

我稍微修改了一下你的HTML。（请注意，反斜杠应该是斜杠。）

为此，首先找到“Hello”字符串。在for循环中调用这些字符串之一s。那么你想要的是s.findParent().findNextSibling()。在

我显示s、s.findParent()和{}来展示我如何从这些字符串中构造出您需要的东西。在

>>> import bs4
>>> HTML = '''\
... <i>Hello</i>
... <a href="www.google.com"> Google </a>
... <i>Bye</i>
... <a href="www.google.com"> Google2 </a>
... <i>Hello</i>
... <a href="www.google.com"> Google3 </a>
... <i>Bye</i>
... '''
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> for s in soup.find_all(string='Hello'):
...     s, s.findParent(), s.findParent().findNextSibling()
...     
('Hello', <i>Hello</i>, <a href="www.google.com"> Google </a>)
('Hello', <i>Hello</i>, <a href="www.google.com"> Google3 </a>)

网友

3楼 · 编辑于 2024-05-02 23:20:39

也许您可以使用re模块。参考参见Regular Expression Howto for py2

str_tags = """
<i>Hello<\i>
<a href="www.google.com"> Google <\a>
<i>Bye<\i>
<a href="www.google.com"> Google2 <\a>
<i>Hello<\i>
<a href="www.google.com"> Google3 <\a>
<i>Bye<\i>
"""

import re
str_re = re.compile(r".*Hello.*\s<a[^>]*>([\w\s]+)<\a>\s<i>Bye")
content_lst = str_re.findall(str_tags)
if content_lst:
    print(content_lst)
else:
    print("Not found")

输出

[' Google ', ' Google3 ']

注意这个方法很大程度上取决于html的外观。有关以上代码的说明，请参阅第一个链接。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

用beauthoulsoup在两个i标签之间找到一个标签

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >