在python中使用regex匹配html标记

str="Generic Name: albuterol inhalation (al BYOO ter all) Brand Names: Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA" br=re.match(" class=\"drug-subtitle\"[^>]*>(.*?)",str)

3条回答

网友

1楼 · 编辑于 2024-09-29 17:19:24

我强烈建议您可以使用一个DOM解析器库，例如lxml和例如cssselect一起使用。在

示例：

>>> from lxml.html import fromstring
>>> html = """<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>"""
>>> doc = fromstring(html)
>>> "".join(filter(None, (e.text for e in doc.cssselect(".drug-subtitle")[0])))
'Generic Name:Brand Names:Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA'

网友

2楼 · 编辑于 2024-09-29 17:19:24

固定的正则表达式就是这个。在我指出的第二行，你会发现它对你不起作用的地方。我使用findall()方便地访问屏幕上所有匹配的组。在

print re.findall('<p class="drug-subtitle"[^>]*>(.*?)</p>',input)
                    ^ you had a > character here

但是，BeautifulSoup将是这类操作的简单选择：

^{pr2}$

网友

3楼 · 编辑于 2024-09-29 17:19:24

如果有输入：

'<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>'

你想检查一下：

^{pr2}$

存在于输入中，要使用的正则表达式是：

\<p\sclass=\"drug-subtitle\"[^>]*>(.*?)\<\/p\>

说明：

\< matches the character < literally
p matches the character p literally (case sensitive)
\s match any white space character [\r\n\t\f ]
class= matches the characters class= literally (case sensitive)
\" matches the character " literally
drug-subtitle matches the characters drug-subtitle literally (case sensitive)
\" matches the character " literally
[^>]* match a single character not present in the list below
    Quantifier: Between zero and unlimited times, as many times as possible,
               giving back as needed.
    > a single character in the list &gt; literally (case sensitive)
> matches the character > literally
1st Capturing group (.*?)
    .*? matches any character (except newline)
        Quantifier: Between zero and unlimited times, as few times as possible,
                    expanding as needed.
\< matches the character < literally
\/ matches the character / literally
p matches the character p literally (case sensitive)
\> matches the character > literally

所以正则表达式中的问题是：

在<；p>；中应该没有“>；”。在
在<；/p>；中，应在“<；，/，>；”字符前面添加“\”来转义它们。在

希望这有帮助。在

相关问题更多 >

编程相关推荐

热门问题

热门文章