用于解析HTML文档的Python Regex表达式

2条回答

网友

1楼 · 编辑于 2024-10-02 08:29:40

这可以通过beautifulsoup轻松完成

from bs4 import BeautifulSoup as soup

x = ['<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td>', '<td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>']
tmp = [soup(y).find('td').find('a') for y in x]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)

如果是单个字符串，那么可以使用

x = '''<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td> <td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>'''
tmp = [y.find('a') for y in soup(x).find_all('td')]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)

如果你还想用regex，那么

<td.*?<a.*? title\s*=\s*"([^"]+).*?</td>

注意：-在第一组中匹配

Regex Demo

网友

2楼 · 编辑于 2024-10-02 08:29:40

将title属性的内容分组到a标记中。它检查是否是排名后的第一个表单元格。你知道吗

regex = /th>\n<td.*?><a .* ?title="(.*?)".*>/

目前已知它可以工作。但这是一个相当脆弱的方法。查看Online Regex Tester查看regex详细信息

相关问题更多 >

编程相关推荐

热门问题

热门文章

用于解析HTML文档的Python Regex表达式

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >