使用BeautifulSoup提取特定的TD表元素文本？

<html> <body> <table class="mainTable"> <thead> <tr> <th>IP</th> <th>Country</th> </tr> </thead> <tbody> <tr> <td><a href="hello.html">127.0.0.1<a></td> <td><img src="uk.gif" /><a href="uk.com">uk</a></td> </tr> <tr> <td><a href="hello.html">192.168.0.1<a></td> <td><img src="uk.gif" /><a href="us.com">us</a></td> </tr> <tr> <td><a href="hello.html">255.255.255.0<a></td> <td><img src="uk.gif" /><a href="br.com">br</a></td> </tr> </tbody> </table>

3条回答

网友

1楼 · 编辑于 2024-05-21 10:33:01

您可以使用一个小正则表达式来提取ip地址。带正则表达式的BeautifulSoup是一个很好的刮削组合。

ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
    if ip_pat.match(row.text):
        print(row.text)

网友

2楼 · 编辑于 2024-05-21 10:33:01

这为您提供了正确的列表：

>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[<a href="hello.html">127.0.0.1<a></a></a>, <a></a>, <a href="hello.html">192.168.0.1<a></a></a>, <a></a>, <a href="hello.html">255.255.255.0<a></a></a>, <a></a>]

只需对这个列表的元素应用.text。

以上列表中有多个空的<a></a>标记，因为html中的<a>标记未正确关闭。为了摆脱它们，你可以使用

pred = lambda tag: tag.parent.find('img') is None and tag.text

最终：

>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']

网友

3楼 · 编辑于 2024-05-21 10:33:01

只在第一个<td>中搜索tbody中的每一行：

# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]

或者更具可读性：

rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]

相关问题更多 >

编程相关推荐

热门问题

热门文章