问题
我试图解析一个包含rowspan的HTML表,就像在分析我的大学日程安排一样。在
我遇到了这样一个问题:如果最后一行包含一个rowspan,那么下一行将缺少一个TD,其中rowspan现在是缺少的TD。在
我不知道如何解释这一点,我希望能够解析这个时间表。在
我尝试了什么
几乎所有我能想到的。在
我得到的结果 在
[
{
'blok_eind': 4,
'blok_start': 3,
'dag': 4, # Should be 5
'leraar': 'DOODF000',
'lokaal': 'ALK C212',
'vak': 'PROJ-T',
},
]
如您所见,在上面的输出片段中有一个值为PROJ-T
的vak
键,dag
是{5
(又称星期五/Vrijdag),如下所示:
我想要的结果
一个Python dict(),看起来像上面发布的,但是值正确
其中:
day
/dag
是1~5的整数,表示星期一到星期五block_start
/blok_start
是一个表示课程何时开始的int(时间块,表的左侧)block_end
/blok_eind
是一个表示课程结束的块的整数classroom
/lokaal
是课程所在的教室代码teacher
/leraar
是教师的IDcourse
/vak
是课程的ID上述数据的基本HTML结构 在
^{pr2}$代码
HTML
<CENTER><font size="3" face="Arial" color="#000000">
<BR></font>
<font size="6" face="Arial" color="#0000FF">
16AO4EIO1B
</font> <font size="4" face="Arial">
IO1B
</font>
<BR>
<TABLE border="3" rules="all" cellpadding="1" cellspacing="1">
<TR>
<TD align="center">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial" color="#000000">
Maandag 29-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Dinsdag 30-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Woensdag 31-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Donderdag 01-09
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Vrijdag 02-09
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>1</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
8:30
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>2</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
10:10
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>3</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
10:25
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
DOODF000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK C212</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
PROJ-T
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>4</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
MENT
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>5</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>6</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
JONGJ003
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
BURG
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>7</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
14:35
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
FLUIP000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B004</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
ICT algemeen Prakti
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>8</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
14:50
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
KOOLE000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
NED
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>9</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>10</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
17:20
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
</TABLE>
<TABLE cellspacing="1" cellpadding="1">
<TR>
<TD valign=bottom> <font size="4" face="Arial" color="#0000FF"></TR></TABLE><font size="3" face="Arial">
Periode1 29-08-2016 (35) - 04-09-2016 (35) G r u b e r & P e t t e r s S o f t w a r e
</font></CENTER>
Python
from pprint import pprint
from bs4 import BeautifulSoup
import requests
r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
"/c/c00025.htm")
daytable = {
1: "Maandag",
2: "Dinsdag",
3: "Woensdag",
4: "Donderdag",
5: "Vrijdag"
}
timetable = {
1: ("8:30", "9:20"),
2: ("9:20", "10:10"),
3: ("10:25", "11:15"),
4: ("11:15", "12:05"),
5: ("12:05", "12:55"),
6: ("12:55", "13:45"),
7: ("13:45", "14:35"),
8: ("14:50", "15:40"),
9: ("15:40", "16:30"),
10: ("16:30", "17:20"),
}
page = BeautifulSoup(r.content, "lxml")
roster = []
big_rows = 2
last_row_big = False
# There are 10 blocks, each made up out of 2 TR's, run through them
for block_count in range(2, 22, 2):
# There are 5 days, first column is not data we want
for day in range(2, 7):
dayroster = {
"dag": 0,
"blok_start": 0,
"blok_eind": 0,
"lokaal": "",
"leraar": "",
"vak": ""
}
# This selector provides the classroom
table_bold = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ") > table > tr > td > font > b")
# This selector provides the teacher's code and the course ID
table = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ") > table > tr > td > font")
# This gets the rowspan on the current row and column
rowspan = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ")")
try:
if table or table_bold and rowspan[0].attrs.get("rowspan") == "4":
last_row_big = True
# Setting end of class
dayroster["blok_eind"] = (block_count // 2) + 1
else:
last_row_big = False
# Setting end of class
dayroster["blok_eind"] = (block_count // 2)
except IndexError:
pass
if table_bold:
x = table_bold[0]
# Classroom ID
dayroster["lokaal"] = x.contents[0]
if table:
iter = 0
for x in table:
content = x.contents[0].lstrip("\r\n").rstrip("\r\n")
# Cell has data
if content != "":
# Set start of class
dayroster["blok_start"] = block_count // 2
# Set day of class
dayroster["dag"] = day - 1
if iter == 0:
# Teacher ID
dayroster["leraar"] = content
elif iter == 1:
# Course ID
dayroster["vak"] = content
iter += 1
if table or table_bold:
# Store the data
roster.append(dayroster)
# Remove duplicates
seen = set()
new_l = []
for d in roster:
t = tuple(d.items())
if t not in seen:
seen.add(t)
new_l.append(d)
pprint(new_l)
也许最好使用bs4内置函数来解析表,比如“findAll”。在
您可以使用以下代码:
您必须跟踪前几行的行跨距,每列一行。在
您可以简单地将rowspan的整数值复制到字典中,随后的行将rowspan值递减,直到它降到
1
(或者,为了便于编码,我们可以将整数值减1并放到0
)。然后,可以根据前面的行跨度调整后续表计数。在您的表使用了一个默认的范围大小2,每两步递增,但可以很容易地将其除以2,从而使这一点变得更加复杂。在
不要使用大量的CSS选择器,只选择表行,我们将迭代这些行:
这将产生正确的输出:
^{pr2}$此外,即使课程的范围超过2个块,或只有一个块,此代码仍将继续工作;支持任何行范围大小。在
相关问题 更多 >
编程相关推荐