在元素上缺少类/id的情况下刮取数据

{ "artist": "Oasis", "albums": { "Definitely Maybe": [ "Rock n Roll Star", "Shakermaker", ... ], "(What's The Story) Morning Glory": [ "Hello", "Roll With It" ... ], ... } }

data = [] for div in soup.find_all("div",{"id":"listAlbum"}): links = div.findAll('a') for a in links: if a.text.strip() is "": pass elif a.text.strip(): data.append(a.text.strip())

1条回答

网友

1楼 · 发布于 2024-05-18 15:47:18

您可以使用^{}来实现这一点。你知道吗

代码：

oasis = {
    'artist': 'Oasis',
    'albums': {}
}

soup = BeautifulSoup(html, 'lxml')  # where html is the html you've provided
all_albums = soup.find('div', id='listAlbum')

first_album = all_albums.find('div', class_='album')
album_name = first_album.b.text
songs = []

for tag in first_album.find_next_siblings(['a', 'div']):
    # If tag is <div> add the previous album.
    if tag.name == 'div':
        oasis['albums'][album_name] = songs
        songs = []
        album_name = tag.b.text

    # If tag is <a> append song to the list.
    else:
        songs.append(tag.text)

# Add the last album
oasis['albums'][album_name] = songs

print(oasis)

输出：

{
    'artist': 'Oasis', 
    'albums': {
        '"Definitely Maybe"': ["Rock 'n' Roll Star", 'Shakermaker', 'Live Forever', 'Up In The Sky', 'Columbia', 'Supersonic', 'Bring It On Down', 'Cigarettes & Alcohol', "Digsy's Diner", 'Slide Away', 'Married With Children', 'Sad Song', ''], 
        '"(What\'s The Story) Morning Glory"': ['Hello', 'Roll With It', 'Wonderwall', "Don't Look Back In Anger", 'Hey Now', 'Some Might Say', 'Cast No Shadow', "She's Electric", 'Morning Glory', 'Champagne Supernova', "Bonehead's Bank Holiday"]
    }
}

编辑：

检查完网站后，我对代码做了一些修改。你知道吗

首先，您需要跳过这个<a id="6910"></a>标记（位于每个专辑的末尾），因为它将添加一首名称为空的歌曲。第二，文本other songs:不位于<b>标记内；因此它将引发album_name = tag.b.text错误。你知道吗

执行以下更改将完全满足您的需要。你知道吗

for tag in first_album.find_next_siblings(['a', 'div']):
    if tag.name == 'div':
        oasis['albums'][album_name] = songs
        songs = []
        album_name = tag.text if tag.text == 'other songs:' else tag.b.text
        continue
    if tag.get('id'):
        continue
    songs.append(tag.text)

最终输出：

{
    'artist': 'Oasis', 
    'albums': {
        '"Definitely Maybe"': ["Rock 'n' Roll Star", 'Shakermaker', 'Live Forever', 'Up In The Sky', 'Columbia', 'Supersonic', 'Bring It On Down', 'Cigarettes & Alcohol', "Digsy's Diner", 'Slide Away', 'Married With Children', 'Sad Song'], 
        '"(What\'s The Story) Morning Glory"': ['Hello', 'Roll With It', 'Wonderwall', "Don't Look Back In Anger", 'Hey Now', 'Some Might Say', 'Cast No Shadow', "She's Electric", 'Morning Glory', 'Champagne Supernova', "Bonehead's Bank Holiday"], 
        '"Be Here Now"': ["D'You Know What I Mean?", 'My Big Mouth', 'Magic Pie', 'Stand By Me', 'I Hope, I Think, I Know', 'The Girl In The Dirty Shirt', 'Fade In-Out', "Don't Go Away", 'Be Here Now', 'All Around The World', "It's Getting Better (Man!!)"], 
        '"The Masterplan"': ['Acquiesce', 'Underneath The Sky', 'Talk Tonight', 'Going Nowhere', 'Fade Away', 'I Am The Walrus (Live)', 'Listen Up', "Rockin' Chair", 'Half The World Away', "(It's Good) To Be Free", 'Stay Young', 'Headshrinker', 'The Masterplan'], 
        '"Standing On The Shoulder Of Giants"': ["Fuckin' In The Bushes", 'Go Let It Out', 'Who Feels Love?', 'Put Yer Money Where Yer Mouth Is', 'Little James', 'Gas Panic!', 'Where Did It All Go Wrong?', 'Sunday Morning Call', 'I Can See A Liar', 'Roll It Over'], 
        '"Heathen Chemistry"': ['The Hindu Times', 'Force Of Nature', 'Hung In A Bad Place', 'Stop Crying Your Heart Out', 'Song Bird', 'Little By Little', '(Probably) All In The Mind', 'She Is Love', 'Born On A Different Cloud', 'Better Man'], 
        '"Don\'t Believe The Truth"': ['Turn Up The Sun', 'Mucky Fingers', 'Lyla', 'Love Like A Bomb', 'The Importance Of Being Idle', 'The Meaning Of Soul', "Guess God Thinks I'm Abel", 'Part Of The Queue', 'Keep The Dream Alive', 'A Bell Will Ring', 'Let There Be Love'], 
        '"Dig Out Your Soul"': ['Bag It Up', 'The Turning', 'Waiting For The Rapture', 'The Shock Of The Lightning', "I'm Outta Time", '(Get Off Your) High Horse Lady', 'Falling Down', "To Be Where There's Life", "Ain't Got Nothin'", 'The Nature Of Reality', 'Soldier On', 'I Believe In All'], 
        'other songs:': ["(As Long As They've Got) Cigarettes In Hell", '(I Got) The Fever', 'Alice', 'Alive', 'Angel Child', 'Boy With The Blues', 'Carry Us All', 'Cloudburst', 'Cum On Feel The Noize', "D'Yer Wanna Be A Spaceman", 'Eyeball Tickler', 'Flashbax', 'Full On', 'Helter Skelter', 'Heroes', 'I Will Believe', "Idler's Dream", 'If We Shadows', "It's Better People", 'Just Getting Older', "Let's All Make Believe", 'My Sister Lover', 'One Way Road', 'Round Are Way', 'Step Out', 'Street Fighting Man', 'Take Me', 'Take Me Away', 'The Fame', 'Whatever', "You've Got To Hide Your Love Away"]
    }
}

相关问题更多 >

编程相关推荐

热门问题

热门文章