当类都相同时,如何从表的特定部分刮取数据?

2024-10-02 04:30:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从这个网站上搜集数据,它有一个不同类别的游戏积分表。 总共有24个类别,我想在24列中进行分类。在示例“https://www.mobygames.com/developer/sheet/view/developerId,1/”中有5个(生产、设计、工程和感谢)

如果不同部分中的行具有不同的类,但它们都具有相同的tr类:“devCreditsHighlight”,这将很容易。不同的页面有不同的部分,根据页面的不同,顺序也会发生变化。除此之外,我需要的信息从表的下一行开始,行数是随机的

我的问题是

有没有办法刮取包含特定关键字的表的特定部分?例如,当您遇到文本“Business”时开始刮取,然后当您遇到^{cl 1}时停止刮取$

下面是我的代码:

import bs4 as bs
import urllib.request


gameurl = "https://www.mobygames.com/developer/sheet/view/developerId,1"

req = urllib.request.Request(gameurl,headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
infopage = soup.find_all("div", {"class":"col-md-8 col-lg-8"})
core_list =[]

for credits in infopage:
        niceHeaderTitle = credits.find_all("h1", {"class":"niceHeaderTitle"})
        name = niceHeaderTitle[0].text

        Titles = credits.find_all("h3", {"class":"clean"})

        Titles = [title.get_text() for title in Titles]

        if 'Business' in Titles:

            businessinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            business = businessinfo[0].get_text(strip=True)


        else:
            business = 'none'


        if 'Production' in Titles:

            productioninfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            production = productioninfo[0].get_text(strip=True)


        else:
            production = 'none'

        if 'Design' in Titles:

            designinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            design = designinfo[0].get_text(strip=True)


        else:
            design = 'none'

        if 'Writers' in Titles:

            writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            writers = writersinfo[0].get_text(strip=True)


        else:
            writers = 'none'            

        if 'Writers' in Titles:

            writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            writers = writersinfo[0].get_text(strip=True)


        else:
            writers = 'none'

        if 'Programming/Engineering' in Titles:

            programinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            program = programinfo[0].get_text(strip=True)


        else:
            video = 'none' 

        if 'Video/Cinematics' in Titles:

            videoinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            video = videoinfo[0].get_text(strip=True)


        else:
            video = 'none'   

        if 'Audio' in Titles:

            Audioinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            audio = Audioinfo[0].get_text(strip=True)


        else:
            audio = 'none' 

        if 'Art/Graphics' in Titles:

            artinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            art = artinfo[0].get_text(strip=True)


        else:
            art = 'none'             


        if 'Support' in Titles:

            supportinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            support = supportinfo[0].get_text(strip=True)


        else:
            support = 'none' 

        if 'Thanks' in Titles:

            thanksinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            thanks = thanksinfo[0].get_text(strip=True)


        else:
            thanks = 'none'             

        games=[name,business,production,design,writers,video,audio,art,support,program,thanks]

        core_list.append(games)            

print (core_list)

这是网页的HTML

</div>
</div></div>
<div class="row">
<div class="col-md-8 col-lg-8" style="overflow: hidden;">
<h1 class="niceHeaderTitle"><div class="pull-right"><a href="https://www.mobygames.com/developer/sheet/contribute/developerId,1/" class="btn btn-xs btn-mobysuccess">Contribute</a> </div>Brian Reynolds (I)</h1><ul class="nav nav-tabs" style="margin-bottom: 15px;"><li class="active"><a href="https://www.mobygames.com/developer/sheet/view/developerId,1/">Main</a></li><li><a href="https://www.mobygames.com/developer/brian-reynolds-i/credits/developerId,1/">Credits</a></li><li><a href="https://www.mobygames.com/developer/sheet/bio/developerId,1/">Biography</a></li><li><a href="https://www.mobygames.com/developer/shots/developerId,1/">Portraits</a></li></ul><h2 class="m5">Game Credits</h2>
<table class="devCreditsTable">
<tr><td colspan=5><h3 class="clean">Production</h3></td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/catan">Catan</a> (2007)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Project Lead</span>)</td></tr>
<tr><td colspan=5>&nbsp;</td></tr>
<tr><td colspan=5><h3 class="clean">Design</h3></td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations-extended-edition">Rise of Nations: Extended Edition</a> (2017)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/dominations">DomiNations</a> (2015)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Creative Director</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/age-of-empires-iii-the-asian-dynasties">Age of Empires III: The Asian Dynasties</a> (2007)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Creative Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/catan">Catan</a> (2007)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Artificial Intelligence</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations-rise-of-legends">Rise of Nations: Rise of Legends</a> (2006)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations-gold-edition">Rise of Nations: Gold Edition</a> (2004)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations-thrones-patriots">Rise of Nations: Thrones &#x26; Patriots</a> (2004)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations">Rise of Nations</a> (2003)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Game Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alpha-centauri-planetary-pack">Sid Meier&#x27;s Alpha Centauri: Planetary Pack</a> (2001)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Created By</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/civilization-ii-multiplayer-gold-edition">Civilization II: Multiplayer Gold Edition</a> (1999)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Game Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alien-crossfire">Sid Meier&#x27;s Alien Crossfire</a> (1999)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alpha-centauri">Sid Meier&#x27;s Alpha Centauri</a> (1999)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Lead Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-gettysburg">Sid Meier&#x27;s Gettysburg!</a> (1997)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-civilization-ii">Sid Meier&#x27;s Civilization II</a> (1996)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Game Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-colonization">Sid Meier&#x27;s Colonization</a> (1994)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Game Design By</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/return-of-the-phantom">Return of the Phantom</a> (1993)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">MADS (MicroProse Adventure Development System) Designed by</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/super-quest">Super Quest</a> (1983)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Original Game Idea ('Quest 1')</span>)</td></tr>
<tr><td colspan=5>&nbsp;</td></tr>
<tr><td colspan=5><h3 class="clean">Programming/Engineering</h3></td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rise-of-nations-rise-of-legends">Rise of Nations: Rise of Legends</a> (2006)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Project Lead</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alpha-centauri-planetary-pack">Sid Meier&#x27;s Alpha Centauri: Planetary Pack</a> (2001)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Additional Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/civilization-ii-multiplayer-gold-edition">Civilization II: Multiplayer Gold Edition</a> (1999)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alien-crossfire">Sid Meier&#x27;s Alien Crossfire</a> (1999)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Additional Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-alpha-centauri">Sid Meier&#x27;s Alpha Centauri</a> (1999)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-gettysburg">Sid Meier&#x27;s Gettysburg!</a> (1997)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-civilization-ii">Sid Meier&#x27;s Civilization II</a> (1996)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/once-upon-a-forest">Once Upon a Forest</a> (1995)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">MADS Game Engine</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/dragonsphere">Dragonsphere</a> (1994)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Technical Director</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-colonization">Sid Meier&#x27;s Colonization</a> (1994)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/return-of-the-phantom">Return of the Phantom</a> (1993)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Lead Programmer</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/f-15-strike-eagle-iii">F-15 Strike Eagle III</a> (1992)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Installation Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/rex-nebular-and-the-cosmic-gender-bender">Rex Nebular and the Cosmic Gender Bender</a> (1992)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Programming</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/quest-1">Quest 1</a> (1981)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">By</span>)</td></tr>
<tr><td colspan=5>&nbsp;</td></tr>
<tr><td colspan=5><h3 class="clean">Support</h3></td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/ville">The Ville</a> (2012)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Additional Help</span>)</td></tr>
<tr><td colspan=5>&nbsp;</td></tr>
<tr><td colspan=5><h3 class="clean">Thanks</h3></td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/kingdoms-of-amalur-reckoning">Kingdoms of Amalur: Reckoning</a> (2012)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Additional Thanks</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/c-evo">C-evo</a> (1999)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Game Design</span>)</td></tr>
<tr class="devCreditsHighlight">
<td><a href="https://www.mobygames.com/game/sid-meiers-civnet">Sid Meier&#x27;s CivNet</a> (1995)</td><td class="devCreditsDivider"> &nbsp; </td><td style="white-space: nowrap;">(<span class="devCreditsTitle">Special Thanks To</span>)</td></tr>
<tr><td colspan=5>&nbsp;</td></tr>
</table>

现在我的结果是:

[[‘贡献布赖恩·雷诺兹(I)’‘无’、‘卡坦(2007)(项目负责人)’、‘卡坦(2007)(项目负责人)’、‘无’、‘无’、‘无’、‘卡坦(2007)(项目负责人)’、‘卡坦(2007)(项目负责人)’]

我想要的是得到表中每个特定部分的内容。例如,第一部分的产品被正确地收集到“Catan(2007)(项目负责人)”,但这仅仅是因为它碰巧在第一项中。因为所有其他人都有相同的类,我不知道如何收集其他内容。所以会是这样的

[[“贡献布赖恩·雷诺兹(I)”,“无”,“卡坦(2007)(项目负责人)”,“卡坦(2007)(项目负责人)”,“无”,“无”,“无”,“国家崛起:扩展版(2017)(设计负责人)统治(2015)(创意总监)帝国时代III:亚洲王朝(2007)(创意负责人)卡坦(2007)(人工智能)国家的崛起:传奇的崛起(2006)(设计领先)国家的崛起:黄金版(2004)(设计领先)国家的崛起:王座与爱国者(2004)(设计领先)国家的崛起(2003)(游戏设计)Sid Meier的阿尔法半人马座:行星包(2001)(由)文明II创建:多人黄金版(1999)(游戏设计)Sid Meier的外星人交火(1999)(设计)Sid Meier的阿尔法半人马座(1999)(主导设计)Sid Meier的埃蒂斯堡!(1997)(设计)Sid Meier的文明II(1996)(游戏设计)Sid Meier的殖民(1994)(游戏设计人)幻影的回归(1993)(MADS(微处理器冒险开发系统)设计人)Super Quest(1983)(原始游戏理念(“任务1”)《国家的崛起:传奇的崛起》(2006)(项目负责人)Sid Meier的阿尔法半人马座:行星包(2001)(附加编程)文明II:多人黄金版(1999)(编程)Sid Meier的外星人交火(1999)(附加编程)Sid Meier的阿尔法半人马座1999(编程)Sid Meier的葛底斯堡!(1997)(编程)Sid Meier的文明II(1996)(编程)曾经在森林(1995)(MADS游戏引擎)Dragonsphere(1994)(技术总监)Sid Meier的殖民(1994)(编程)幻影的回归(1993)(首席编程)F-15攻击鹰III(1992)(安装编程)雷克斯星云和宇宙性别本德尔(1992)(编程)探索1(1981)(作者:“‘The Ville(2012)(附加帮助)’”)]


Tags: httpscomgamestylewwwtrclasstd
2条回答

您可以遍历表中的行,如果找到类别,则可以在rows数组中获取该元素的索引。索引+1将是包含游戏名称的tr。此示例显示如何在循环中获取类别及其游戏

for credits in infopage:

niceHeaderTitle = credits.find_all("h1", {"class":"niceHeaderTitle"})
name = niceHeaderTitle[0].text

rows = credits.find_all('tr')
for row in rows:
    cat = row.find("h3")
    if cat:
        index = rows.index(row)
        game = rows[index + 1].find("a")
        if game:
            print("Category: ", cat.text, " Game: ", game.text)

似乎Production出现为none的原因是因为在本例中productionTitle实际上是一个数组,因此您可能需要循环遍历它以查找每个元素中是否都有“Production”

 productionTitle = credits.find_all("h3", {"class":"clean"})

    if 'Production' in productionTitle:

        productioninfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
        production = productioninfo[0].get_text(strip=True)


    else:
        production = 'none'

我可以看出这是多么困难,格式很难处理,因为标题不是该部分的父项,但我找到了解决方法

编辑:现在拉所有行,而不仅仅是第一行

import bs4 as bs
import urllib.request
import numpy as np

gameurl = "https://www.mobygames.com/developer/sheet/view/developerId,1"

req = urllib.request.Request(gameurl,headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
infopage = soup.find_all("div", {"class":"col-md-8 col-lg-8"})
core_list =[]

for credits in infopage:
        niceHeaderTitle = credits.find_all("h1", {"class":"niceHeaderTitle"})
        name = niceHeaderTitle[0].text

        Titles = credits.find_all("h3", {"class":"clean"})

        Titles = [title.get_text() for title in Titles]

        tr = credits.find_all("tr")

        for i in range(len(tr)):
            row = tr[i].get_text()
            if row in Titles:
                title = row
            elif len(row) > 1:
                games=[name,title,row]
                core_list.append(games)

core_list = np.matrix(core_list)

要将其放入列中,因为每个人都有一组不同的类别,所以您可能应该像这样提取它,添加所有其他人,然后使用一些数据帧操作(透视)将其放入您要查找的24列中

相关问题 更多 >

    热门问题