为什么BeautifulSoup有时使用find_all查找所有元素,而有时不使用find_all查找所有元素?

2024-06-01 07:03:28 发布

您现在位置:Python中文网/ 问答频道 /正文

因此,我正试图清理这个表https://en.wikipedia.org/wiki/Korean_drama#List_of_highest-rated_Korean_dramas_in_cable_television网络列让我很烦恼

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://en.wikipedia.org/wiki/Korean_drama")
bsObj = BeautifulSoup(html, features="lxml")
kdramas = bsObj.find("span", {
    "id": "List_of_highest-rated_Korean_dramas_in_cable_television"})
list_kdramas = kdramas.parent.next_sibling.next_sibling.next_sibling.next_sibling
table = list_kdramas.find_all('tr')
final = []

for i in range(1, len(table)):
    temp = []  # temporary array for storing the subvalues of each row
    row = table[i].find_all('td')
    for k in range(len(row)-1):
        try:
            temp.append(row[k].get_text())
        except AttributeError:
            temp.append(row[k].find('a').get_text())

    final.append(temp)
for i in final:
    if len(i) == 5:
        print("Rank:{}, Show: {}, Channel: {}, Rating: {}, Date:{} ".format(
            i[0], i[1], i[2], i[3], i[4]))
    else:
        print("Rank:{}, Show: {}, Rating: {}, Date: {}".format(
            i[0], i[1], i[2], i[3]))

在我的一些电视节目的输出中,有一个名为network的栏目没有出现,这就是为什么我必须检查期末考试数组中每个I的长度,以确保格式不会弄乱

这是输出(仅显示前5个),您可以看到其中一些没有任何通道

Rank:1 Show: The World of the Married Channel: JTBC, Rating: 28.371% Date:16 May 2020
 
Rank:2 Show: SKY Castle Rating: 23.779% Date: 1 February 2019

Rank:3 Show: Crash Landing on You Channel: tvN, Rating: 21.683% Date:16 February 2020
 
Rank:4 Show: Reply 1988 Rating: 18.803% Date: 16 January 2016

Rank:5 Show: Guardian: The Lonely and Great God Rating: 18.680% Date: 21 January 2017


Tags: ofinfordateshowtablefindtemp
2条回答

这是因为表的结构:

&13; 第13部分,;
tr, td {
  border: 1px solid darkgrey;
}
<table>
  <tr>
    <td>column 1, row 1</td>
    <td rowspan="2">column 2, row 1</td>
  </tr>
  <tr>
    <td>column 1, row 2</td>
  </tr>
  <tr>
    <td>column 1, row 3</td>
    <td>column 2, row 3</td>
  </tr>
  <tr>
    <td>column 1, row 4</td>
    <td>column 2, row 4</td>
  </tr>
</table>
和#13;
和#13;

在列“Network”中,由于元素“td”的属性“rowspan”,一些单元格会扩展到几行。该属性定义td元素应该覆盖多少行。但在随后的行中,缺少相应的td元素(这就是为什么在结果中也缺少通道)

要获取rowspan值,可以使用以下代码

rowspan = int(row[k].get('rowspan'))

此脚本将跨多行展开<td rowspan="..">,因此您可以获得正确的信息:

import requests
from bs4 import BeautifulSoup


url = 'https://en.wikipedia.org/wiki/Korean_drama'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = soup.select_one('#List_of_highest-rated_Korean_dramas_in_cable_television').find_next('table')


def expand_rowspans(table):
    while table.select_one('td[rowspan]'):
        td = table.select_one('td[rowspan]')
        n = td.find_parent('tr').find_all('td', recursive=False).index(td)
        rs = int(td.attrs.pop('rowspan'))
        for tr in td.find_parent('tr').find_next_siblings('tr')[:rs-1]:
            tr.select_one('td:nth-child({})'.format(n)).insert_after(BeautifulSoup(str(td), 'html.parser'))


expand_rowspans(table)

for row in table.select('tr:has(td)'):
    tds = [td.get_text(strip=True) for td in row.select('td')]
    print("Rank:{:<3} Show: {:<40} Channel: {:<10} Rating: {:<10} Date: {:<10}".format(*tds))

印刷品:

Rank:1   Show: The World of the Married                 Channel: JTBC       Rating: 28.371%    Date: 16 May 2020
Rank:2   Show: SKY Castle                               Channel: JTBC       Rating: 23.779%    Date: 1 February 2019
Rank:3   Show: Crash Landing on You                     Channel: tvN        Rating: 21.683%    Date: 16 February 2020
Rank:4   Show: Reply 1988                               Channel: tvN        Rating: 18.803%    Date: 16 January 2016
Rank:5   Show: Guardian: The Lonely and Great God       Channel: tvN        Rating: 18.680%    Date: 21 January 2017
Rank:6   Show: Mr. Sunshine                             Channel: tvN        Rating: 18.129%    Date: 30 September 2018
Rank:7   Show: Itaewon Class                            Channel: JTBC       Rating: 16.548%    Date: 21 March 2020
Rank:8   Show: 100 Days My Prince                       Channel: tvN        Rating: 14.412%    Date: 30 October 2018
Rank:9   Show: Hospital Playlist                        Channel: tvN        Rating: 14.142%    Date: 28 May 2020
Rank:10  Show: Signal                                   Channel: tvN        Rating: 12.544%    Date: 12 March 2016
Rank:11  Show: The Lady in Dignity                      Channel: JTBC       Rating: 12.065%    Date: 19 August 2017
Rank:12  Show: Hotel del Luna                           Channel: tvN        Rating: 12.001%    Date: 1 September 2019
Rank:13  Show: Reply 1994                               Channel: tvN        Rating: 11.509%    Date: 28 December 2013
Rank:14  Show: Prison Playbook                          Channel: tvN        Rating: 11.195%    Date: 18 January 2018
Rank:15  Show: The Crowned Clown                        Channel: tvN        Rating: 10.851%    Date: 4 March 2019
Rank:16  Show: My Kids Give Me a Headache               Channel: JTBC       Rating: 10.715%    Date: 17 March 2013
Rank:17  Show: Encounter                                Channel: tvN        Rating: 10.329%    Date: 24 January 2019
Rank:18  Show: Memories of the Alhambra                 Channel: tvN        Rating: 10.025%    Date: 20 January 2019
Rank:19  Show: Another Miss Oh                          Channel: tvN        Rating: 9.991%     Date: 28 June 2016
Rank:20  Show: The Light in Your Eyes                   Channel: JTBC       Rating: 9.731%     Date: 19 March 2019
Rank:21  Show: Strong Girl Bong-soon                    Channel: JTBC       Rating: 9.668%     Date: 15 April 2017
Rank:22  Show: Lawless Lawyer                           Channel: tvN        Rating: 8.937%     Date: 1 July 2018
Rank:23  Show: What's Wrong with Secretary Kim          Channel: tvN        Rating: 8.665%     Date: 26 July 2018
Rank:24  Show: Graceful Family                          Channel: MBN        Rating: 8.478%     Date: 17 October 2019
Rank:25  Show: Misty                                    Channel: JTBC       Rating: 8.452%     Date: 24 March 2018
Rank:26  Show: Misaeng: Incomplete Life                 Channel: tvN        Rating: 8.240%     Date: 20 December 2014
Rank:27  Show: Familiar Wife                            Channel: tvN        Rating: 8.210%     Date: 20 September 2018
Rank:28  Show: Dear My Friends                          Channel: tvN        Rating: 8.087%     Date: 2 July 2016
Rank:29  Show: Live                                     Channel: tvN        Rating: 7.730%     Date: 6 May 2018
Rank:30  Show: Arthdal Chronicles                       Channel: tvN        Rating: 7.705%     Date: 22 September 2019
Rank:31  Show: Stranger 2                               Channel: tvN        Rating: 7.627%     Date: (currently airing)
Rank:32  Show: The Good Detective                       Channel: JTBC       Rating: 7.609%     Date: 25 August 2020
Rank:33  Show: My Mister                                Channel: tvN        Rating: 7.352%     Date: 17 May 2018
Rank:34  Show: It's Okay to Not Be Okay                 Channel: tvN        Rating: 7.348%     Date: 9 August 2020
Rank:35  Show: Oh My Ghost                              Channel: tvN        Rating: 7.337%     Date: 22 August 2015
Rank:36  Show: Something in the Rain                    Channel: JTBC       Rating: 7.281%     Date: 19 May 2018
Rank:37  Show: Second 20s                               Channel: tvN        Rating: 7.233%     Date: 17 October 2015
Rank:38  Show: Cheese in the Trap                       Channel: tvN        Rating: 7.102%     Date: 1 March 2016
Rank:39  Show: Voice 2                                  Channel: OCN        Rating: 7.086%     Date: 16 September 2018
Rank:40  Show: A Korean Odyssey                         Channel: tvN        Rating: 6.942%     Date: 4 March 2018
Rank:41  Show: Live Up to Your Name                     Channel: tvN        Rating: 6.907%     Date: 1 October 2017
Rank:42  Show: The Cursed                               Channel: tvN        Rating: 6.721%     Date: 17 March 2020
Rank:43  Show: Romance Is a Bonus Book                  Channel: tvN        Rating: 6.651%     Date: 17 March 2019
Rank:44  Show: The K2                                   Channel: tvN        Rating: 6.636%     Date: 12 November 2016
Rank:45  Show: Watcher                                  Channel: OCN        Rating: 6.585%     Date: 25 August 2019
Rank:46  Show: Stranger                                 Channel: tvN        Rating: 6.568%     Date: 30 July 2017
Rank:47  Show: Hi Bye, Mama!                            Channel: tvN        Rating: 6.519%     Date: 19 April 2020
Rank:48  Show: Tunnel                                   Channel: OCN        Rating: 6.490%     Date: 21 May 2017
Rank:49  Show: Queen: Love and War                      Channel: TV Chosun  Rating: 6.348%     Date: 9 February 2020
Rank:50  Show: Avengers Social Club                     Channel: tvN        Rating: 6.330%     Date: 16 November 2017

相关问题 更多 >