Python:webscraping标记导航wiki选项卡

2024-06-18 13:01:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图用bs4来隔离“职业历史”——球员参加过的球队名单——NFL Qbs表的一部分:

我想要的输出是:

['St. Louis Rams (2005–2006)', 'Cincinnati Bengals (2007–2008)', 'Buffalo Bills (2009–2012)', 'Tennessee Titans (2013)', 'Houston Texans (2014)', 'New York Jets (2015–2016)', 'Tampa Bay Buccaneers (2017–2018)', 'Miami Dolphins (2019–present)']

我的代码是:

url = 'https://en.wikipedia.org/wiki/Ryan_Fitzpatrick'
table = BeautifulSoup(player_wiki.text , 'html.parser')

for tr  in table.find('tbody').find_all('ul'):
  v = [li.text for li in tr.find_all('li')]
  print(v)

电流输出:

['St. Louis Rams (2005–2006)', 'Cincinnati Bengals (2007–2008)', 'Buffalo Bills (2009–2012)', 'Tennessee Titans (2013)', 'Houston Texans (2014)', 'New York Jets (2015–2016)', 'Tampa Bay Buccaneers (2017–2018)', 'Miami Dolphins (2019–present)']
['Ivy League Player of the Year (2004)', 'First-team All–Ivy League (2004)', 'George H. “Bulger” Lowe Award (2004)']

我肯定这是我的外环的“ul”标签。如何缩小find_all()的范围以防止不需要的数据?有什么建议吗?我是新的网页刮。你知道吗


Tags: newliallfindstbuffalolouisbills
2条回答

您可以使用soup.find_all

import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://en.wikipedia.org/wiki/Ryan_Fitzpatrick').text, 'html.parser')
result = [i.get_text(strip=True) for i in d.find('table', {'class':'infobox vcard'}).find_all('tr')[12].find_all('li')]

输出:

['St. Louis Rams(2005–2006)', 'Cincinnati Bengals(2007–2008)', 'Buffalo Bills(2009–2012)', 'Tennessee Titans(2013)', 'Houston Texans(2014)', 'New York Jets(2015–2016)', 'Tampa Bay Buccaneers(2017–2018)', 'Miami Dolphins(2019–present)']

方法1-使用requestsbeautifulsoup4

    import requests
    from bs4 import BeautifulSoup

    r = requests.get('https://en.wikipedia.org/wiki/Ryan_Fitzpatrick')
    soup = BeautifulSoup(r.text, 'html.parser')

    for item in soup.find('tbody').findAll('ul'):
        for href in item.findAll('a'):
            print(href.get_text())

方法2-使用wikipedia模块:

    from bs4 import BeautifulSoup
    import wikipedia

    ry = wikipedia.page('Ryan_Fitzpatrick')
    soup = BeautifulSoup(ry.html(), 'html.parser')
    career_history = []
    for tr in soup.find('tbody').find_all('ul'):
        for li in tr.find_all('li'):
          career_history.append(li.text)

    print(career_history)

输出:

['St. Louis Rams (2005–2006)', 'Cincinnati Bengals (2007–2008)', 'Buffalo Bills (2009–2012)', 'Tennessee Titans (2013)', 'Houston Texans (2014)', 'New York Jets (2015–2016)', 'Tampa Bay Buccaneers (2017–2018)', 'Miami Dolphins (2019–present)', 'Ivy League Player of the Year (2004)', 'First-team All–Ivy League (2004)', 'George H. “Bulger” LoweAward (2004)']

相关问题 更多 >