如何使用Regex findall查找多个模式？

<ul><li>Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li> <li>Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li> <li>Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li> <li>Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>,</ul>

1条回答

网友

1楼 · 发布于 2024-10-02 00:21:27

在您的示例中re是一个很好的解决方案，您不必搜索其他方法

最终你可以把它写成

tags = re.findall('<(li|ul)>', html)
print(len(tags))

但是如果您得到更复杂的标记，如<ul class="...">（或更复杂），那么regex将不起作用，更好（更容易）的方法是使用lxml、BeautifulSoup或其他HTML解析器

lxml:

import lxml.html

html = '''<ul id="list">
<li class="first">Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li name="second">Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li style="color:red">Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li data="last">Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>
</ul>'''

soup = lxml.html.fromstring(html)

li_tags = soup.xpath('//li')
ul_tags = soup.xpath('//ul')
count = len(li_tags) + len(ul_tags)

print(count)

你甚至可以试试

tags = soup.xpath('//ul|//li')
print(len(tags))

BeautifulSoup:

from bs4 import BeautifulSoup

html = '''<ul id="list">
<li class="first">Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li name="second">Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li style="color:red">Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li data="last">Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>
</ul>'''

soup = BeautifulSoup(html, 'html.parser')

li_tags = soup.find_all('li')
ul_tags = soup.find_all('ul')
count = len(li_tags) + len(ul_tags)

print(count)

你甚至可以

tags = soup.find_all(('ul', 'li'))
print(len(tags))

编辑：在{}每个标记{}，{}中，我添加了额外的信息-{}，{}，{}，{}，{}，代码仍然可以正常工作，没有任何更改

对于regex，它需要'<(li|ul).*>'或'<(ul|li)'。但对于更复杂的事情，它需要更复杂的变化

相关问题更多 >

编程相关推荐

热门问题

热门文章