因此,我正试图清理这个表https://en.wikipedia.org/wiki/Korean_drama#List_of_highest-rated_Korean_dramas_in_cable_television网络列让我很烦恼
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://en.wikipedia.org/wiki/Korean_drama")
bsObj = BeautifulSoup(html, features="lxml")
kdramas = bsObj.find("span", {
"id": "List_of_highest-rated_Korean_dramas_in_cable_television"})
list_kdramas = kdramas.parent.next_sibling.next_sibling.next_sibling.next_sibling
table = list_kdramas.find_all('tr')
final = []
for i in range(1, len(table)):
temp = [] # temporary array for storing the subvalues of each row
row = table[i].find_all('td')
for k in range(len(row)-1):
try:
temp.append(row[k].get_text())
except AttributeError:
temp.append(row[k].find('a').get_text())
final.append(temp)
for i in final:
if len(i) == 5:
print("Rank:{}, Show: {}, Channel: {}, Rating: {}, Date:{} ".format(
i[0], i[1], i[2], i[3], i[4]))
else:
print("Rank:{}, Show: {}, Rating: {}, Date: {}".format(
i[0], i[1], i[2], i[3]))
在我的一些电视节目的输出中,有一个名为network的栏目没有出现,这就是为什么我必须检查期末考试数组中每个I的长度,以确保格式不会弄乱
这是输出(仅显示前5个),您可以看到其中一些没有任何通道
Rank:1 Show: The World of the Married Channel: JTBC, Rating: 28.371% Date:16 May 2020
Rank:2 Show: SKY Castle Rating: 23.779% Date: 1 February 2019
Rank:3 Show: Crash Landing on You Channel: tvN, Rating: 21.683% Date:16 February 2020
Rank:4 Show: Reply 1988 Rating: 18.803% Date: 16 January 2016
Rank:5 Show: Guardian: The Lonely and Great God Rating: 18.680% Date: 21 January 2017
这是因为表的结构:
在列“Network”中,由于元素“td”的属性“rowspan”,一些单元格会扩展到几行。该属性定义td元素应该覆盖多少行。但在随后的行中,缺少相应的td元素(这就是为什么在结果中也缺少通道)
要获取rowspan值,可以使用以下代码
此脚本将跨多行展开
<td rowspan="..">
,因此您可以获得正确的信息:印刷品: