在python中,如何在web scapping期间通过beautifulSoup只生成一个函数来访问不同博客上的文章?

2024-09-30 16:24:15 发布

您现在位置:Python中文网/ 问答频道 /正文

第一篇博客的帖子之一的html页面

<div class="entry-content"> <p>We are under the same sky.</p> <p>You and I.</p> <p>I share the soul of earth with you,</p> <p>to contribute a verse too.</p> <p>I have words to give,</p> <p>a smile to offer.</p> <p>You are at your right place.</p> <p>You live ,you stay ,you move ,you play.</p> <p>May also have works to do and words to say.</p> <p>We may cross each other or not.</p> <p>But the thing is, we are here,</p> <p>in this instant;So what, not so clear.</p> <p>But the powerful play goes on,</p> <p>for you may contribute a verse.</p> <div id="wordads-preview-parent" class="wpcnt"> <div class="wpa"> <span class="wpa-about">Advertisements</span> <div class="u"> <div class="wpa-notice"> <p>Occasionally, some of your visitors may see an advertisement here, <br />as well as a <a href="https://en.support.wordpress.com/cookie-widget/" target="_blank">Privacy & Cookies banner</a> at the bottom of the page.<br/>You can hide ads completely by upgrading to one of our paid plans.</p> <p class="wpa-buttons"> <a class="wpa-button is-primary" id="wordads-preview-more" href="https://wordpress.com/plans/141006071/?feature=no-adverts&utm_campaign=removeadsnotive" rel="nofollow" target="_blank">Upgrade now</a> <a class="wpa-button" id="wordads-preview-dismiss" href="#">Dismiss message</a> </p> </div> </div> </div> </div>

第二个博客的帖子之一的HTML页面

<div class="entry-content"> <h2><span style="color:#000000;">There are lessons which aren&#8217;t taught</span></h2> <h2><span style="color:#000000;">Everything black isn&#8217;t always dark<img data-attachment-id="38" data-permalink="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/ea530f2a5c6b48821056deb178ed1747/" data-orig-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg" data-orig-size="500,379" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="ea530f2a5c6b48821056deb178ed1747" data-image-description="" data-medium-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" data-large-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=490" class="alignright wp-image-38" src="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" alt="ea530f2a5c6b48821056deb178ed1747" width="328" height="248" srcset="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&amp;h=248 328w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=150&amp;h=114 150w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=300&amp;h=227 300w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg 500w" sizes="(max-width: 328px) 100vw, 328px" /></span></h2> <h2><span style="color:#000000;">Everything you love isn&#8217;t always desired</span></h2> <h2><span style="color:#000000;">Everything you need isn&#8217;t always desired</span></h2> <h2><span style="color:#000000;">Everything you look isn&#8217;t always watched</span></h2> <h2><span style="color:#000000;">And everything you do isn&#8217;t always what u did.</span></h2> <h2><span style="color:#ff0000;">REMEMBER!!!!!</span></h2> <div id="jp-post-flair" class="sharedaddy sd-like-enabled sd-sharing-enabled"><div class="sharedaddy sd-sharing-enabled"><div class="robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing"><h3 class="sd-title">Share this:</h3><div class="sd-content"><ul><li class="share-press-this"><a rel="nofollow" data-shared="" class="share-press-this sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=press-this" rel="noopener noreferrer" target="_blank" title="Click to Press This!"><span>Press This</span></a></li><li class="share-twitter"><a rel="nofollow" data-shared="sharing-twitter-27" class="share-twitter sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=twitter" rel="noopener noreferrer" target="_blank" title="Click to share on Twitter"><span>Twitter</span></a></li><li class="share-facebook"><a rel="nofollow" data-shared="sharing-facebook-27" class="share-facebook sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=facebook" rel="noopener noreferrer" target="_blank" title="Click to share on Facebook"><span>Facebook</span></a></li><li class="share-google-plus-1"><a rel="nofollow" data-shared="sharing-google-27" class="share-google-plus-1 sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=google-plus-1" rel="noopener noreferrer" target="_blank" title="Click to share on Google+"><span>Google</span></a></li><li class="share-end"></li></ul></div></div></div><div class='sharedaddy sd-block sd-like jetpack-likes-widget-wrapper jetpack-likes-widget-unloaded' id='like-post-wrapper-127135943-27-5b54d1ab0f8b1' data-src='//widgets.wp.com/likes/index.html?ver=20180319#blog_id=127135943&amp;post_id=27&amp;origin=awistfulwind.wordpress.com&amp;obj_id=127135943-27-5b54d1ab0f8b1' data-name='like-post-frame-127135943-27-5b54d1ab0f8b1'><h3 class='sd-title'>Like this:</h3><div class='likes-widget-placeholder post-likes-widget-placeholder' style='height: 55px;'><span class='button'><span>Like</span></span> <span class="loading">Loading...</span></div><span class='sd-text-color'></span><a class='sd-link-color'></a></div></div> </div><!-- .entry-content --> </div><!-- .entry-body -->

请帮我从这个html中删除文章的内容,这个html可以用于两篇文章,我也可以用于其他博客。


Tags: tohttpsdivcomsharedatawordpressli
1条回答
网友
1楼 · 发布于 2024-09-30 16:24:15

主要问题是删除不必要的广告和横幅。我做了一个简单的函数scrap_data(),在这里提供数据字符串,它将返回废弃的内容:

data_1 = """
<div class="entry-content">
        <p>We are under the same sky.</p>
<p>You and I.</p>
<p>I share the soul of earth with you,</p>
<p>to contribute a verse too.</p>
<p>I have words to give,</p>
<p>a smile to offer.</p>
<p>You are at your right place.</p>
<p>You live ,you stay ,you move ,you play.</p>
<p>May also have works to do and words to say.</p>
<p>We may cross each other or not.</p>
<p>But the thing is, we are here,</p>
<p>in this instant;So what, not so clear.</p>
<p>But the powerful play goes on,</p>
<p>for you may contribute a verse.</p>
        <div id="wordads-preview-parent" class="wpcnt">
            <div class="wpa">
                <span class="wpa-about">Advertisements</span>
                <div class="u">
                    <div class="wpa-notice">
                        <p>Occasionally, some of your visitors may see an advertisement here, <br />as well as a <a href="https://en.support.wordpress.com/cookie-widget/" target="_blank">Privacy & Cookies banner</a> at the bottom of the page.<br/>You can hide ads completely by upgrading to one of our paid plans.</p>
                        <p class="wpa-buttons">
                            <a class="wpa-button is-primary" id="wordads-preview-more" href="https://wordpress.com/plans/141006071/?feature=no-adverts&utm_campaign=removeadsnotive" rel="nofollow" target="_blank">Upgrade now</a>
                            <a class="wpa-button" id="wordads-preview-dismiss" href="#">Dismiss message</a>
                        </p>
                    </div>
                </div>
            </div>
        </div>"""

data_2 = """
<div class="entry-content">
            <h2><span style="color:#000000;">There are lessons which aren&#8217;t taught</span></h2>
<h2><span style="color:#000000;">Everything black isn&#8217;t always dark<img data-attachment-id="38" data-permalink="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/ea530f2a5c6b48821056deb178ed1747/" data-orig-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg" data-orig-size="500,379" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="ea530f2a5c6b48821056deb178ed1747" data-image-description="" data-medium-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" data-large-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=490" class="alignright  wp-image-38" src="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" alt="ea530f2a5c6b48821056deb178ed1747" width="328" height="248" srcset="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&amp;h=248 328w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=150&amp;h=114 150w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=300&amp;h=227 300w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg 500w" sizes="(max-width: 328px) 100vw, 328px" /></span></h2>
<h2><span style="color:#000000;">Everything you love isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you need isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you look isn&#8217;t always watched</span></h2>
<h2><span style="color:#000000;">And everything you do isn&#8217;t always what u did.</span></h2>
<h2><span style="color:#ff0000;">REMEMBER!!!!!</span></h2>
<div id="jp-post-flair" class="sharedaddy sd-like-enabled sd-sharing-enabled"><div class="sharedaddy sd-sharing-enabled"><div class="robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing"><h3 class="sd-title">Share this:</h3><div class="sd-content"><ul><li class="share-press-this"><a rel="nofollow" data-shared="" class="share-press-this sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=press-this" rel="noopener noreferrer" target="_blank" title="Click to Press This!"><span>Press This</span></a></li><li class="share-twitter"><a rel="nofollow" data-shared="sharing-twitter-27" class="share-twitter sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=twitter" rel="noopener noreferrer" target="_blank" title="Click to share on Twitter"><span>Twitter</span></a></li><li class="share-facebook"><a rel="nofollow" data-shared="sharing-facebook-27" class="share-facebook sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=facebook" rel="noopener noreferrer" target="_blank" title="Click to share on Facebook"><span>Facebook</span></a></li><li class="share-google-plus-1"><a rel="nofollow" data-shared="sharing-google-27" class="share-google-plus-1 sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=google-plus-1" rel="noopener noreferrer" target="_blank" title="Click to share on Google+"><span>Google</span></a></li><li class="share-end"></li></ul></div></div></div><div class='sharedaddy sd-block sd-like jetpack-likes-widget-wrapper jetpack-likes-widget-unloaded' id='like-post-wrapper-127135943-27-5b54d1ab0f8b1' data-src='//widgets.wp.com/likes/index.html?ver=20180319#blog_id=127135943&amp;post_id=27&amp;origin=awistfulwind.wordpress.com&amp;obj_id=127135943-27-5b54d1ab0f8b1' data-name='like-post-frame-127135943-27-5b54d1ab0f8b1'><h3 class='sd-title'>Like this:</h3><div class='likes-widget-placeholder post-likes-widget-placeholder' style='height: 55px;'><span class='button'><span>Like</span></span> <span class="loading">Loading...</span></div><span class='sd-text-color'></span><a class='sd-link-color'></a></div></div>        </div><!  .entry-content  >
    </div><!  .entry-body  >"""

from bs4 import BeautifulSoup

def scrap_data(data):
    soup = BeautifulSoup(data, 'lxml')
    # remvove advertisements
    for div in soup.select('div#wordads-preview-parent'):
        div.clear()
    for div in soup.select('div#jp-post-flair'):
        div.clear()
    return soup.select_one('.entry-content').text.strip()

print(scrap_data(data_1))
print('-' * 80)
print(scrap_data(data_2))
print('-' * 80)

印刷品:

We are under the same sky.
You and I.
I share the soul of earth with you,
to contribute a verse too.
I have words to give,
a smile to offer.
You are at your right place.
You live ,you stay ,you move ,you play.
May also have works to do and words to say.
We may cross each other or not.
But the thing is, we are here,
in this instant;So what, not so clear.
But the powerful play goes on,
for you may contribute a verse.
                                        
There are lessons which aren’t taught
Everything black isn’t always dark
Everything you love isn’t always desired
Everything you need isn’t always desired
Everything you look isn’t always watched
And everything you do isn’t always what u did.
REMEMBER!!!!!
                                        

相关问题 更多 >