Python正则表达式HTML

2条回答

网友

1楼 · 编辑于 2024-05-18 12:03:49

使用HTML解析器，比如^{}。它提供了一种指定正则表达式以匹配属性值的方法：

soup.find_all('a', href=re.compile("after=t3_\w+"))

工作示例：

import re
from bs4 import BeautifulSoup
import requests

url = "https://www.reddit.com/r/spacex/?count=25&after=t3_319905"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content)

print soup.find_all('a', href=re.compile("after=t3_\w+"))

另请参见regex+HTML问题的“必须提供”链接：

RegEx match open tags except XHTML self-contained tags

网友

2楼 · 编辑于 2024-05-18 12:03:49

?是regex中的一个特殊字符，它使前面的标记成为可选的。您需要在regex中转义?，以便匹配文字?字符。你也需要逃逸点，但不是.+?中的点。你知道吗

re.search(r'(<a href=")(https://www\.reddit\.com/r/spacex/\?count=25.+?)(")', subreddit).group(2)
                                                          ^
                                                          |

这里不需要额外的捕获组。只要一个抓捕小组就够了。你知道吗

re.search(r'<a href="(https://www\.reddit\.com/r/spacex/\?count=25.+?)"', subreddit).group(1)

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python正则表达式HTML

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >