这个正则表达式有什么错

3条回答

网友

1楼 · 编辑于 2024-09-27 21:34:02

如果使用多个捕获组，re.findall返回元组列表而不是字符串列表。尝试以下操作（仅使用单个组）：

>>> import re
>>> page = '''
...     <a href="http://asecuritysite.com">here</a>
...     <a href="https://www.sans.org/webcasts/archive/2013">there</a>
...     '''
>>> re.findall(r'href="(https?:\/\/[^"]+)"',page)
['http://asecuritysite.com', 'https://www.sans.org/webcasts/archive/2013']

根据^{} documentation：

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

网友

2楼 · 编辑于 2024-09-27 21:34:02

尝试摆脱第二组（原始模式中的(s?)）：

links = re.findall('href="(https?:\/\/[^"]+)"',page)

网友

3楼 · 编辑于 2024-09-27 21:34:02

你做错的是试图用Regex解析HTML。先生，这是罪过。你知道吗

See here for the horrors of Regex parsing HTML

另一种方法是使用lxml这样的东西来解析页面并提取链接

urls = html.xpath('//a/@href')

相关问题更多 >

编程相关推荐

热门问题

热门文章

这个正则表达式有什么错

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >