如何编写一个正则表达式来提取无序列表和前面的段落

2024-10-06 11:18:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个漂亮的soup对象,我把它转换成了一个字符串,我想把项目符号列表的所有实例和它们前面的段落都拉出来。示例如下:

...
    <p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
    <ul>
    <li>You are experiencing a decrease in sales and customers</li>
    <li>If your brand design does not reflect what you deliver</li>
    <li>If you want to attract a new target audience</li>
    <li>Management change</li>
    <li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
    </ul>
...

我使用以下正则表达式:

re.findall('<p>.*</p>\n<ul>.*</ul>', string)

但是,它返回一个空列表。最好的办法是什么?你知道吗


Tags: toyou列表ifislibeul
2条回答

为什么您需要regex而beautifulsoup能够完全处理任何类型的html—最好尝试css selectors这里div.Mother div.Son ul li的意思是选择所有divs的类名Mother,然后在里面选择所有divs的类名Son,然后选择里面的ul,最后选择里面的所有li。你知道吗

from bs4 import BeautifulSoup as bs

data = """

    <body>
    <div class="Mother" >
        <div class="Son" >
            <p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
            <ul>
                <li>You are experiencing a decrease in sales and customers</li>
                <li>If your brand design does not reflect what you deliver</li>
                <li>If you want to attract a new target audience</li>
                <li>Management change</li>
                <li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
            </ul>
        </div>
    </div>
</body>

"""

soup = bs(data,'lxml')
#To grab all inside the ul
for item in soup.select('div.Mother div.Son'):
    print item.text.strip()
print  "="*100
#Just to grab all li    
for li in soup.select('div.Mother div.Son ul li'):
    print li.text.strip()

输出-

It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:

You are experiencing a decrease in sales and customers
If your brand design does not reflect what you deliver
If you want to attract a new target audience
Management change
19 Questions to Ask Yourself Before You Start Rebranding
====================================================================================================
You are experiencing a decrease in sales and customers
If your brand design does not reflect what you deliver
If you want to attract a new target audience
Management change
19 Questions to Ask Yourself Before You Start Rebranding

不要使用正则表达式来解析HTML!

BeautifulSoup可以轻松、优雅、正确地完成您想要的一切:

>>> soup = bs4.BeautifulSoup(r"""
    <p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
    <ul>
    <li>You are experiencing a decrease in sales and customers</li>
    <li>If your brand design does not reflect what you deliver</li>
    <li>If you want to attract a new target audience</li>
    <li>Management change</li>
    <li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
    </ul>
""")
>>> bulleted_lists = soup.findAll('ul')
>>> uls_with_ps = [(ul.findPrevious('p'), ul) for ul in bulleted_lists]

要了解情况:

>>> bulleted_lists
[<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
</ul>]

>>> bulleted_lists[0].findPrevious('p')
<p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>

相关问题 更多 >