在python中使用regex匹配html标记

2024-09-29 17:19:24 发布

您现在位置:Python中文网/ 问答频道 /正文

str="<p class=\"drug-subtitle\"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>"

br=re.match("<p> class=\"drug-subtitle\"[^>]*>(.*?)</p>",str)

br无返回

我使用的正则表达式有什么错误?在


Tags: namebrgenericclassalsubtitleterstr
3条回答

我强烈建议您可以使用一个DOM解析器库,例如lxml和例如cssselect一起使用。在

示例:

>>> from lxml.html import fromstring
>>> html = """<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>"""
>>> doc = fromstring(html)
>>> "".join(filter(None, (e.text for e in doc.cssselect(".drug-subtitle")[0])))
'Generic Name:Brand Names:Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA'

固定的正则表达式就是这个。在我指出的第二行,你会发现它对你不起作用的地方。我使用findall()方便地访问屏幕上所有匹配的组。在

print re.findall('<p class="drug-subtitle"[^>]*>(.*?)</p>',input)
                    ^ you had a > character here

但是,BeautifulSoup将是这类操作的简单选择:

^{pr2}$

如果有输入:

'<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>' 

你想检查一下:

^{pr2}$

存在于输入中,要使用的正则表达式是:

\<p\sclass=\"drug-subtitle\"[^>]*>(.*?)\<\/p\> 

说明:

\< matches the character < literally
p matches the character p literally (case sensitive)
\s match any white space character [\r\n\t\f ]
class= matches the characters class= literally (case sensitive)
\" matches the character " literally
drug-subtitle matches the characters drug-subtitle literally (case sensitive)
\" matches the character " literally
[^>]* match a single character not present in the list below
    Quantifier: Between zero and unlimited times, as many times as possible,
               giving back as needed.
    > a single character in the list &gt; literally (case sensitive)
> matches the character > literally
1st Capturing group (.*?)
    .*? matches any character (except newline)
        Quantifier: Between zero and unlimited times, as few times as possible,
                    expanding as needed.
\< matches the character < literally
\/ matches the character / literally
p matches the character p literally (case sensitive)
\> matches the character > literally

所以正则表达式中的问题是:

  1. 在<;p>;中应该没有“>;”。在
  2. 在<;/p>;中,应在“<;,/,>;”字符前面添加“\”来转义它们。在

希望这有帮助。在

相关问题 更多 >

    热门问题