如何在python中基于条件提取文本

<a href="/title/tt0110912/" title="Quentin Tarantino"> Pulp Fiction </a> <a href="/title/tt0137523/" title="David Fincher"> Fight Club </a> <a href="blablabla" title="Yet to Release"> Yet to Release </a> <a href="something" title="Movies"> Coming soon </a>

3条回答

网友

1楼 · 编辑于 2024-09-28 19:22:56

您可以使用正则表达式搜索属性的内容（在本例中为href）

有关更多详细信息，请参阅以下答案：https://stackoverflow.com/a/47091570/1426630

网友

2楼 · 编辑于 2024-09-28 19:22:56

1.）要获取所有<a>标记，其中href=以"/title/"开头，可以使用CSS选择器a[href^="/title/"]

2.）要去除标记内的所有文本，可以使用.get_text()和参数strip=True

soup = BeautifulSoup(html_text, 'html.parser')

out = [a.get_text(strip=True) for a in soup.select('a[href^="/title/"]')]
print(out)

印刷品：

['Pulp Fiction', 'Fight Club']

网友

3楼 · 编辑于 2024-09-28 19:22:56

我猜你想要这样：

from bs4 import BeautifulSoup

html = '''<a href="/title/tt0110912/" title="Quentin Tarantino">
Pulp Fiction
</a>

<a href="/title/tt0137523/" title="David Fincher">
Fight Club
</a>

<a href="blablabla" title="Yet to Release">
Yet to Release
</a>

<a href="something" title="Movies">
Coming soon
</a>
'''

soup = BeautifulSoup(html, 'html.parser')

titles = []

for a in soup.select('a[href*="/title/"]',href=True):
    if a.text:
        titles.append(a.text.replace('\n'," "))
print(titles)

输出：

[' Pulp Fiction ', ' Fight Club ']

相关问题更多 >

编程相关推荐

热门问题

热门文章