如何使用Regex findall查找多个模式?

2024-10-02 00:21:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我的任务是从一个字符串中获取“li”、“ul”标记,并对它们进行计数。 这是我尝试过的,它很有效,但正在寻找更好的解决方案

<ul><li>Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li>Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li>Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li>Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>,</ul>

代码:

liTag = re.findall('<li>',String)
ulTag = re.findall('<ul>',String)
count = len(liTag) + len(ulTag)

Tags: orandtheyouyouraswithsocial
1条回答
网友
1楼 · 发布于 2024-10-02 00:21:27

在您的示例中re是一个很好的解决方案,您不必搜索其他方法

最终你可以把它写成

tags = re.findall('<(li|ul)>', html)
print(len(tags))

但是如果您得到更复杂的标记,如<ul class="...">(或更复杂),那么regex将不起作用,更好(更容易)的方法是使用lxmlBeautifulSoup或其他HTML解析器

lxml:

import lxml.html

html = '''<ul id="list">
<li class="first">Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li name="second">Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li style="color:red">Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li data="last">Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>
</ul>'''

soup = lxml.html.fromstring(html)

li_tags = soup.xpath('//li')
ul_tags = soup.xpath('//ul')
count = len(li_tags) + len(ul_tags)

print(count)

你甚至可以试试

tags = soup.xpath('//ul|//li')
print(len(tags))

BeautifulSoup:

from bs4 import BeautifulSoup

html = '''<ul id="list">
<li class="first">Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li name="second">Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li style="color:red">Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li data="last">Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>
</ul>'''

soup = BeautifulSoup(html, 'html.parser')

li_tags = soup.find_all('li')
ul_tags = soup.find_all('ul')
count = len(li_tags) + len(ul_tags)

print(count)

你甚至可以

tags = soup.find_all(('ul', 'li'))
print(len(tags))

编辑:在{}每个标记{},{}中,我添加了额外的信息-{},{},{},{},{},代码仍然可以正常工作,没有任何更改

对于regex,它需要'<(li|ul).*>''<(ul|li)'。但对于更复杂的事情,它需要更复杂的变化

相关问题 更多 >

    热门问题