贪婪的正则表达式

2024-06-02 00:03:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在编写一个regex来获取""之间的数据。我遇到的唯一问题是最后一个"被捕获。Regex

  line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'
  capture_regex = re.compile(r'(?<=HREF=").*?"',re.IGNORECASE)
  m = capture_regex.search(line)

m.group()打印https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html"。如何在不包含最后一个引号的地方编写正则表达式。你知道吗

回答了我的问题。我在正则表达式中添加了所谓的非贪婪。 capture_regex = re.compile(r'(?<=HREF=").*?(?=")',re.IGNORECASE)。通过在*之后添加?使它只在第一个"处停止。你知道吗


Tags: httpsorgrehtmllineregexsheetcapture
3条回答
capture_regex = re.compile(r'(?<=HREF=").*?(?=")',re.IGNORECASE)

working fiddle

编辑:调整正则表达式,因为它太贪婪了。感谢@newdeveloper指出这一点!你知道吗

这将起作用:

import re

line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'

capture_regex = re.compile(r'(?<=HREF=")([^"]*)(?:")',re.IGNORECASE)
# capture_regex = re.compile(r'(?:HREF=")([^"]*)(?:")',re.IGNORECASE) this will work too
print(capture_regex.search(line).groups())
# print(capture_regex.findall(line))  # if your text contains more than one HREF

输出:

  ['https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html']

也许,bs4的find_all可以正常工作:

from bs4 import BeautifulSoup

line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'
soup = BeautifulSoup(line, 'html.parser')

for l in soup.find_all('a', href=True):
    print(l['href'])

输出

https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html

如果不是,也许,一些类似于

(?i)href="\s*([^\s"]*?)\s*"

re.findall可能在这里工作:

import re

expression = r'(?i)href="\s*([^\s"]*?)\s*"'

string = """
<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>
<DT><A HREF=" https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html " ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>
"""

print(re.findall(expression, string))

输出

['https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html', 'https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


相关问题 更多 >