BeautySoup4在将其转换为“html”或“lxml”时会删除之后的所有内容?

2024-04-24 03:00:16 发布

您现在位置:Python中文网/ 问答频道 /正文

因此,我在Anaconda的Python3.8上使用bs4和请求包。我正试图从voxforge.com上获取所有的.tgz文件名。然而,在我使用request并将其转换为soup之后,之后的所有信息都消失了Link to page

    import requests
    import bs4
    
    r = requests.get('http://www.repository.voxforge1.org/downloads/fr/Trunk/Audio/Main/16kHz_16bit/')
    r.text

这将返回我需要的所有内容(并持续一段时间):

'<title>VoxForge Repository</title>\n\n\t<style type="text/css">\n\t.siteFunctions {\n\t\ttext-align: right;\n\t}\n\t.copyright {\n\t\ttest-align: left;\n\t\tcolor: #2E3436;\n\t\tfont-family: sans-serif;\n                font-size: small;\n\t}\n\n\tbody {\n\t\tfont-family: "DejaVu Sans", "Lucida Sans Unicode", sans-serif;\n\t\tfont-weight:\tnormal;\n\t\tword-spacing:\tnormal;\n\t\tletter-spacing:\tnormal; \n\t\ttext-transform:\tnone;\n\t\tfont-size: medium;\n                text-align: justify;\n\t}\n\th2 {\n\t\tfont-size:\t1.5em;\n\t\tfont-weight:\t700;\n\t\tmargin-top:1em;\n\t\tmargin-bottom:0.8em;\n\t}\n\th3 {\n\t\tfont-size:\t1.1em;\n\t\tfont-weight:\t600;\n\t\tmargin-top:1em;\n\t\tmargin-bottom:0.4em;\n\t}\n\tp, ol, ul {\n\t\tfont-size:\t1em;\n\t\tmargin-top:0.4em;\n\t\tmargin-bottom:0.4em;\n\t}\t\n\t.heading {\n\t\tbackground-color: #555753;\n                color: #D3D7CF;\n\t\tfont-size: 40px;\n\t\tvertical-align: bottom;\n\t}\n\t.logo {\n\t\twidth: 100px; \n\t\tfloat: left;\n\t\ttext-align: left;\n\t}\n\t.logo img {\n\t\tborder: 0px;\n\t}\n\timg {\n\t\tborder: 0px;\n\t}\n\t.clickableicons {\n\t}\n\t.endFloat {\n\t\tclear: both;\n\t\n\t}\n\t.padding {\n\t\tpadding: 10px;\n\t}\n\t.bodyContent {\n\t\tbackground-color: #ffffff;\n\t\tcolor: #2E3436;\n                text-align: justify;\n\t}\n\t.menu {\n                color: #D3D7CF;\n\t\tbackground-color: #555753;\n\t\ttext-align: left;\n\t}\n\n\t.menu2 {\n                color: #D3D7CF;\n\t\tbackground-color: #555753;\n\t\ttext-align: center;\n\t\t\n\t}\n\ta {\n\t\tcolor: #f57900;\n\t\ttext-decoration:none;\n\t}\n\ta:visited {\n\t\tcolor: #ce5c00;\n\t}\n\ta:hover {\n                text-decoration:underline;\n\t}\n\t.menu a {\n\t\tcolor: #D3D7CF;\n\t\tfont-weight: bold; \n\t}\n\t.menu a:hover {\n\t\tcolor: #eeeeec;\n\t\ttext-decoration:none;\n\t}\n\n\t</style>\n</head><body>\n\n\n\n<div class="heading">\n<div class="padding">\n<div class="logo"><a href="http://www.voxforge.org"><img src="http://www.voxforge.org/uploads/8k/N8/8kN884Cd96cmBZxRlzmbzQ/voxforge-logo.jpg" alt="VoxForge Repository"> </a></div> \n\n<div class="endFloat"></div>\n\n</div>\n</div>\n\n<div class="menu">\n\t<div class="padding">\t\t\n\t\t\n\t\t\n<span class="horizontalMenu">\n\n<a class="horizontalMenu" href="http://www.voxforge.org/home">Home</a>\n    &middot; \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/read">Read</a>\n    &middot; \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/listen">Listen</a>\n    &middot; \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/forums">Forums</a>\n    &middot; \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/dev">Dev</a>\n\n    &middot; \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/downloads">Downloads</a>\n    &middot; \n\n<a class="horizontalMenu" href="http://www.voxforge.org/home/about">About</a>\n \n\n  \n\n</span></div>\n\n</div>\n\n\n\n</div>\n\n</body></html>\n<pre><img src="/spicons/blank.gif" alt="Icon "> <a href="?C=N;O=D">Name</a>                                          <a href="?C=M;O=A">Last modified</a>      <a href="?C=S;O=A">Size</a>  <hr><img src="/spicons/back.gif" alt="[PARENTDIR]"> <a href="/downloads/fr/Trunk/Audio/Main/">Parent Directory</a>                                                   -   \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="4h-20100505-vgm.tgz">4h-20100505-vgm.tgz</a>                           2010-05-13 11:34  1.6M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Agoniste-20130928-bfg.tgz">Agoniste-20130928-bfg.tgz</a>                     2014-02-17 05:02  1.8M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Agoniste-20130928-fnn.tgz">Agoniste-20130928-fnn.tgz</a>                     2014-02-18 04:32  1.9M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Agoniste-20130928-gaf.tgz">Agoniste-20130928-gaf.tgz</a>                     2014-02-18 04:32  2.0M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Agoniste-20130928-izd.tgz">Agoniste-20130928-izd.tgz</a>                     2014-02-18 04:32  1.8M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Agoniste-20130928-ndz.tgz">Agoniste-20130928-ndz.tgz</a>                     2014-02-18 04:32  1.8M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Agoniste-20130928-pzq.tgz">Agoniste-20130928-pzq.tgz</a>                     2014-02-18 04:32  2.0M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Agoniste-20130928-qyu.tgz">Agoniste-20130928-qyu.tgz</a>                     2014-02-18 04:32  2.1M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Agoniste-20130928-rva.tgz">Agoniste-20130928-rva.tgz</a>                     2014-02-18 04:32  1.8M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Agoniste-20130928-vio.tgz">Agoniste-20130928-vio.tgz</a>                     2014-06-10 04:44  1.7M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Alliage-20151109-cyf.tgz">Alliage-20151109-cyf.tgz</a>                      2015-11-13 04:08  1.1M  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Alliage-20151109-dqh.tgz">Alliage-20151109-dqh.tgz</a>                      2015-11-13 04:08  960K  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Alliage-20151109-ewg.tgz">Alliage-20151109-ewg.tgz</a>                      2015-11-13 04:08  963K  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Alliage-20151109-imx.tgz">Alliage-20151109-imx.tgz</a>                      2015-11-13 04:08  855K  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Alliage-20151109-kny.tgz">Alliage-20151109-kny.tgz</a>                      2015-11-13 04:08  924K  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Alliage-20151109-lcn.tgz">Alliage-20151109-lcn.tgz</a>                      2015-11-13 04:08  910K  \n<img src="/spicons/compressed.gif" alt="[   ]"> <a href="Alliage-20151109-rxi.tgz">Alliage-20151109-rxi.tgz</a>    

但是当我使用bs4将其转换为html或lxml时:

    soup = bs4.BeautifulSoup(r.text, 'html')
    soup

我只取回了第一部分,其他所有信息在</车身>;:

   <html><head><title>VoxForge Repository</title>
<style type="text/css">
    .siteFunctions {
        text-align: right;
    }
    .copyright {
        test-align: left;
        color: #2E3436;
        font-family: sans-serif;
                font-size: small;
    }

    body {
        font-family: "DejaVu Sans", "Lucida Sans Unicode", sans-serif;
        font-weight:    normal;
        word-spacing:   normal;
        letter-spacing: normal; 
        text-transform: none;
        font-size: medium;
                text-align: justify;
    }
    h2 {
        font-size:  1.5em;
        font-weight:    700;
        margin-top:1em;
        margin-bottom:0.8em;
    }
    h3 {
        font-size:  1.1em;
        font-weight:    600;
        margin-top:1em;
        margin-bottom:0.4em;
    }
    p, ol, ul {
        font-size:  1em;
        margin-top:0.4em;
        margin-bottom:0.4em;
    }   
    .heading {
        background-color: #555753;
                color: #D3D7CF;
        font-size: 40px;
        vertical-align: bottom;
    }
    .logo {
        width: 100px; 
        float: left;
        text-align: left;
    }
    .logo img {
        border: 0px;
    }
    img {
        border: 0px;
    }
    .clickableicons {
    }
    .endFloat {
        clear: both;
    
    }
    .padding {
        padding: 10px;
    }
    .bodyContent {
        background-color: #ffffff;
        color: #2E3436;
                text-align: justify;
    }
    .menu {
                color: #D3D7CF;
        background-color: #555753;
        text-align: left;
    }

    .menu2 {
                color: #D3D7CF;
        background-color: #555753;
        text-align: center;
        
    }
    a {
        color: #f57900;
        text-decoration:none;
    }
    a:visited {
        color: #ce5c00;
    }
    a:hover {
                text-decoration:underline;
    }
    .menu a {
        color: #D3D7CF;
        font-weight: bold; 
    }
    .menu a:hover {
        color: #eeeeec;
        text-decoration:none;
    }

    </style>
</head><body>
<div class="heading">
<div class="padding">
<div class="logo"><a href="http://www.voxforge.org"><img alt="VoxForge Repository" src="http://www.voxforge.org/uploads/8k/N8/8kN884Cd96cmBZxRlzmbzQ/voxforge-logo.jpg"/> </a></div>
<div class="endFloat"></div>
</div>
</div>
<div class="menu">
<div class="padding">
<span class="horizontalMenu">
<a class="horizontalMenu" href="http://www.voxforge.org/home">Home</a>
    · 

<a class="horizontalMenu" href="http://www.voxforge.org/home/read">Read</a>
    · 

<a class="horizontalMenu" href="http://www.voxforge.org/home/listen">Listen</a>
    · 

<a class="horizontalMenu" href="http://www.voxforge.org/home/forums">Forums</a>
    · 

<a class="horizontalMenu" href="http://www.voxforge.org/home/dev">Dev</a>

    · 

<a class="horizontalMenu" href="http://www.voxforge.org/home/downloads">Downloads</a>
    · 

<a class="horizontalMenu" href="http://www.voxforge.org/home/about">About</a>
</span></div>
</div>
</body></html>

我正在尝试在</车身>;,所以我需要找到一种方法来提取它们,bs4似乎正在删除它们。有人能帮忙吗


2条回答

另一个解决方案:

import bs4
import requests


r = requests.get('http://www.repository.voxforge1.org/downloads/fr/Trunk/Audio/Main/16kHz_16bit/')
soup = bs4.BeautifulSoup(r.content, 'html.parser')

for a in soup.select('a[href*=".tgz"]'):
    print(a['href'])

印刷品:

4h-20100505-vgm.tgz
Agoniste-20130928-bfg.tgz
Agoniste-20130928-fnn.tgz
Agoniste-20130928-gaf.tgz
Agoniste-20130928-izd.tgz
Agoniste-20130928-ndz.tgz
Agoniste-20130928-pzq.tgz
Agoniste-20130928-qyu.tgz
Agoniste-20130928-rva.tgz

...and so on.

尝试:

import requests
import bs4

r = requests.get('http://www.repository.voxforge1.org/downloads/fr/Trunk/Audio/Main/16kHz_16bit/')

soup = bs4.BeautifulSoup(r.content, 'html.parser')
for link in soup.select('a~ img+ a')[1:]:
    print(link.get('href'))

印刷品:

4h-20100505-vgm.tgz
Agoniste-20130928-bfg.tgz
Agoniste-20130928-fnn.tgz
Agoniste-20130928-gaf.tgz
Agoniste-20130928-izd.tgz
Agoniste-20130928-ndz.tgz
Agoniste-20130928-pzq.tgz
Agoniste-20130928-qyu.tgz
Agoniste-20130928-rva.tgz
Agoniste-20130928-vio.tgz
Alliage-20151109-cyf.tgz

等等

相关问题 更多 >