在元素上缺少类/id的情况下刮取数据

2024-05-18 15:47:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试搜集数据来构建一个对象,它看起来像

{
    "artist": "Oasis",
        "albums": {
            "Definitely Maybe": [
                "Rock n Roll Star",
                "Shakermaker",
                ...
            ],

            "(What's The Story) Morning Glory": [
                "Hello",
                "Roll With It"
                ...
            ],
            ...
        }
}

Here is how the HTML on the page looks;


我现在正在像这样废弃数据

data = []
for div in soup.find_all("div",{"id":"listAlbum"}):
    links = div.findAll('a')
    for a in links:
        if a.text.strip() is "":
            pass
        elif a.text.strip():
            data.append(a.text.strip())

同样,获取专辑名称也很简单

for div in soup.find_all("div",{"class":"album"}):
    titles = div.findAll('b')
    for t in titles:
        ...

我的问题是如何使用上面的两个循环来构建一个类似于顶部的对象。我如何才能确保从X专辑的歌曲,进入正确的专辑对象。如果每首歌都有一个album属性,我就很清楚了。然而,由于HTML的结构是这样的-我有点不知所措。你知道吗

编辑:查找下面的HTML

<div id="listAlbum"> <a id="1368"></a> <div class="album">album: <b>"Definitely Maybe"</b> (1994)</div> <a href="../lyrics/oasis/rocknrollstar.html" target="_blank">Rock 'n' Roll Star</a><br> <a href="../lyrics/oasis/shakermaker.html" target="_blank">Shakermaker</a><br> <a href="../lyrics/oasis/liveforever.html" target="_blank">Live Forever</a><br> <a href="../lyrics/oasis/upinthesky.html" target="_blank">Up In The Sky</a><br> <a href="../lyrics/oasis/columbia.html" target="_blank">Columbia</a><br> <a href="../lyrics/oasis/supersonic.html" target="_blank">Supersonic</a><br> <a href="../lyrics/oasis/bringitondown.html" target="_blank">Bring It On Down</a><br> <a href="../lyrics/oasis/cigarettesalcohol.html" target="_blank">Cigarettes &amp; Alcohol</a><br> <a href="../lyrics/oasis/digsysdiner.html" target="_blank">Digsy's Diner</a><br> <a href="../lyrics/oasis/slideaway.html" target="_blank">Slide Away</a><br> <a href="../lyrics/oasis/marriedwithchildren.html" target="_blank">Married With Children</a><br> <a href="../lyrics/oasis/sadsong.html" target="_blank">Sad Song</a><br> <a id="1366"></a> <div class="album">album: <b>"(What's The Story) Morning Glory"</b> (1995)</div> <a href="../lyrics/oasis/hello.html" target="_blank">Hello</a><br> <a href="../lyrics/oasis/rollwithit.html" target="_blank">Roll With It</a><br> <a href="../lyrics/oasis/wonderwall.html" target="_blank">Wonderwall</a><br> <a href="../lyrics/oasis/dontlookbackinanger.html" target="_blank">Don't Look Back In Anger</a><br> <a href="../lyrics/oasis/heynow.html" target="_blank">Hey Now</a><br> <a href="../lyrics/oasis/somemightsay.html" target="_blank">Some Might Say</a><br> <a href="../lyrics/oasis/castnoshadow.html" target="_blank">Cast No Shadow</a><br> <a href="../lyrics/oasis/sheselectric.html" target="_blank">She's Electric</a><br> <a href="../lyrics/oasis/morningglory.html" target="_blank">Morning Glory</a><br> <a href="../lyrics/oasis/champagnesupernova.html" target="_blank">Champagne Supernova</a><br> <a href="../lyrics/oasis/boneheadsbankholiday.html" target="_blank">Bonehead's Bank Holiday</a><br>

Tags: the对象inbrdividtargetfor
1条回答
网友
1楼 · 发布于 2024-05-18 15:47:18

您可以使用^{}来实现这一点。你知道吗

代码:

oasis = {
    'artist': 'Oasis',
    'albums': {}
}

soup = BeautifulSoup(html, 'lxml')  # where html is the html you've provided
all_albums = soup.find('div', id='listAlbum')

first_album = all_albums.find('div', class_='album')
album_name = first_album.b.text
songs = []

for tag in first_album.find_next_siblings(['a', 'div']):
    # If tag is <div> add the previous album.
    if tag.name == 'div':
        oasis['albums'][album_name] = songs
        songs = []
        album_name = tag.b.text

    # If tag is <a> append song to the list.
    else:
        songs.append(tag.text)

# Add the last album
oasis['albums'][album_name] = songs

print(oasis)

输出:

{
    'artist': 'Oasis', 
    'albums': {
        '"Definitely Maybe"': ["Rock 'n' Roll Star", 'Shakermaker', 'Live Forever', 'Up In The Sky', 'Columbia', 'Supersonic', 'Bring It On Down', 'Cigarettes & Alcohol', "Digsy's Diner", 'Slide Away', 'Married With Children', 'Sad Song', ''], 
        '"(What\'s The Story) Morning Glory"': ['Hello', 'Roll With It', 'Wonderwall', "Don't Look Back In Anger", 'Hey Now', 'Some Might Say', 'Cast No Shadow', "She's Electric", 'Morning Glory', 'Champagne Supernova', "Bonehead's Bank Holiday"]
    }
}

编辑:

检查完网站后,我对代码做了一些修改。你知道吗

首先,您需要跳过这个<a id="6910"></a>标记(位于每个专辑的末尾),因为它将添加一首名称为空的歌曲。第二,文本other songs:不位于<b>标记内;因此它将引发album_name = tag.b.text错误。你知道吗

执行以下更改将完全满足您的需要。你知道吗

for tag in first_album.find_next_siblings(['a', 'div']):
    if tag.name == 'div':
        oasis['albums'][album_name] = songs
        songs = []
        album_name = tag.text if tag.text == 'other songs:' else tag.b.text
        continue
    if tag.get('id'):
        continue
    songs.append(tag.text)

最终输出:

{
    'artist': 'Oasis', 
    'albums': {
        '"Definitely Maybe"': ["Rock 'n' Roll Star", 'Shakermaker', 'Live Forever', 'Up In The Sky', 'Columbia', 'Supersonic', 'Bring It On Down', 'Cigarettes & Alcohol', "Digsy's Diner", 'Slide Away', 'Married With Children', 'Sad Song'], 
        '"(What\'s The Story) Morning Glory"': ['Hello', 'Roll With It', 'Wonderwall', "Don't Look Back In Anger", 'Hey Now', 'Some Might Say', 'Cast No Shadow', "She's Electric", 'Morning Glory', 'Champagne Supernova', "Bonehead's Bank Holiday"], 
        '"Be Here Now"': ["D'You Know What I Mean?", 'My Big Mouth', 'Magic Pie', 'Stand By Me', 'I Hope, I Think, I Know', 'The Girl In The Dirty Shirt', 'Fade In-Out', "Don't Go Away", 'Be Here Now', 'All Around The World', "It's Getting Better (Man!!)"], 
        '"The Masterplan"': ['Acquiesce', 'Underneath The Sky', 'Talk Tonight', 'Going Nowhere', 'Fade Away', 'I Am The Walrus (Live)', 'Listen Up', "Rockin' Chair", 'Half The World Away', "(It's Good) To Be Free", 'Stay Young', 'Headshrinker', 'The Masterplan'], 
        '"Standing On The Shoulder Of Giants"': ["Fuckin' In The Bushes", 'Go Let It Out', 'Who Feels Love?', 'Put Yer Money Where Yer Mouth Is', 'Little James', 'Gas Panic!', 'Where Did It All Go Wrong?', 'Sunday Morning Call', 'I Can See A Liar', 'Roll It Over'], 
        '"Heathen Chemistry"': ['The Hindu Times', 'Force Of Nature', 'Hung In A Bad Place', 'Stop Crying Your Heart Out', 'Song Bird', 'Little By Little', '(Probably) All In The Mind', 'She Is Love', 'Born On A Different Cloud', 'Better Man'], 
        '"Don\'t Believe The Truth"': ['Turn Up The Sun', 'Mucky Fingers', 'Lyla', 'Love Like A Bomb', 'The Importance Of Being Idle', 'The Meaning Of Soul', "Guess God Thinks I'm Abel", 'Part Of The Queue', 'Keep The Dream Alive', 'A Bell Will Ring', 'Let There Be Love'], 
        '"Dig Out Your Soul"': ['Bag It Up', 'The Turning', 'Waiting For The Rapture', 'The Shock Of The Lightning', "I'm Outta Time", '(Get Off Your) High Horse Lady', 'Falling Down', "To Be Where There's Life", "Ain't Got Nothin'", 'The Nature Of Reality', 'Soldier On', 'I Believe In All'], 
        'other songs:': ["(As Long As They've Got) Cigarettes In Hell", '(I Got) The Fever', 'Alice', 'Alive', 'Angel Child', 'Boy With The Blues', 'Carry Us All', 'Cloudburst', 'Cum On Feel The Noize', "D'Yer Wanna Be A Spaceman", 'Eyeball Tickler', 'Flashbax', 'Full On', 'Helter Skelter', 'Heroes', 'I Will Believe', "Idler's Dream", 'If We Shadows', "It's Better People", 'Just Getting Older', "Let's All Make Believe", 'My Sister Lover', 'One Way Road', 'Round Are Way', 'Step Out', 'Street Fighting Man', 'Take Me', 'Take Me Away', 'The Fame', 'Whatever', "You've Got To Hide Your Love Away"]
    }
}

相关问题 更多 >

    热门问题