使用BeautifulSoup提取不在标记之间的文本

网友

1楼 · 编辑于 2024-06-25 23:55:30

顺便说一句，我不知道你在找什么。但根据评论和其他答案

下面应该实现你的目标

from bs4 import BeautifulSoup


html = '''<div class="filmo-row even" id="actor-tt14677742">
    <span class="year_column">2021</span>
    <b><a href="/title/tt14677742/">Welcome Back Future</a></b>
     (Short)
    <br/>
     Leo
</div>'''


soup = BeautifulSoup(html, 'lxml')
print(list(soup.select_one('.filmo-row').stripped_strings))

输出：

['2021', 'Welcome Back Future', '(Short)', 'Leo']

网友

2楼 · 编辑于 2024-06-25 23:55:30

我对bs4了解不多，但不知怎么的，我在寻找next_sibling，这就解决了我的问题

所以我这样做：

category = movie_soup.find_all('b')[0].next_sibling
if 'TV' in category or 'Short' in category or 'Series' in category or 'Video' in category or 'Documentary' in category:
    return None, None

如果我发现我不需要的电影，因为它属于我不需要的类别之一，我会返回None，None。我知道这不是最好的代码风格，但它对我很有用

网友

3楼 · 编辑于 2024-06-25 23:55:30

您可以使用以下选项：

from bs4 import BeautifulSoup as bs

HTML="""<div class="filmo-row even" id="actor-tt14677742">
    <span class="year_column">2021</span>
    <b><a href="/title/tt14677742/">Welcome Back Future</a></b>
     (Short)
    <br/>
     Leo
</div>
"""

soup=bs(HTML,"lxml")

print(soup.find("div").find_all(text=True,recursive=False))
# ['\n', '\n', '\n     (Short)\n    ', '\n     Leo\n']

# If you use html5lib as parse then answer is a bit different:
soup=bs(HTML,"html5lib")
print(soup.find("div").find_all(text=True,recursive=False))
# ['\n    ', '\n    ', '\n     (Short)\n    ', '\n     Leo\n']

# If you want all of the text from div then try this:
print(soup.find("div").find_all(text=True,recursive=True))
# ['\n', '2021', '\n', 'Welcome Back Future', '\n     (Short)\n    ', '\n     Leo\n']
# Or simply use
print(soup.find("div").text)
"""
2021
Welcome Back Future
     (Short)

     Leo

"""

我想你现在可以把它清理干净了，我相信会得到他们作为演员主演的所有电影的名单意味着你还需要Leo

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用BeautifulSoup提取不在标记之间的文本

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >