用Python从html文件中收集信息

<TR> <TD VALIGN="top"> /s/ ROBERT F. MANGANO<HR WIDTH="91%" SIZE="1" NOSHADE COLOR="#000000" ALIGN="left"></TD> <TD VALIGN="bottom">  </TD> <TD VALIGN="top" ROWSPAN="2"> President, Chief Executive Officer and Director (Principal Executive Officer)</TD> <TD VALIGN="bottom"> </TD> <TD VALIGN="top" ROWSPAN="2" ALIGN="center">March 24, 2005</TD></TR>

def htmlParser(self): pageTree = html.fromstring(self.pageContent) print "page parsed!" tdTexts = pageTree.xpath("//td/descendant::*/text()") cleanTexts = [eachText.strip() for eachText in tdTexts if eachText.strip()] for i in range(1,len(cleanTexts)): if ('/s/' in cleanTexts[i] and (i+1) < len(cleanTexts)): title = [] title = [cleanTexts [i+1] for eachKeyword in titleKeywords if eachKeyword in cleanTexts [i+1].lower()] if (title): print title self.boards.append([self.pageURL,cleanTexts[i].replace('/s/',''),cleanTexts [i+1]]) print self.boards elif (i+2) < len(cleanTexts): title = [cleanTexts [i+2] for eachKeyword in titleKeywords if eachKeyword in cleanTexts [i+2].lower()] if (title): self.boards.append([self.pageURL,cleanTexts[i].replace('/s/',''),cleanTexts [i+2]])

</TR> <TR VALIGN="TOP"> <TD WIDTH="40%" ALIGN="CENTER" VALIGN="CENTER">/s/  JONATHAN C. COON      <HR NOSHADE> Jonathan C. Coon</TD> <TD WIDTH="3%" VALIGN="CENTER"> </TD> <TD WIDTH="58%" VALIGN="CENTER">Chief Executive Officer and Director (principal executive officer)</TD> </TR>

1条回答

网友

1楼 · 发布于 2024-09-30 14:27:57

使用beautifulSoup解析html：

from bs4 import BeautifulSoup

html = """
<TR>
<TD VALIGN="top"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New Roman"           SIZE="2">/s/ ROBERT F. MANGANO</FONT></P><HR WIDTH="91%" SIZE="1" NOSHADE COLOR="#000000"  ALIGN="left"></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="top" ROWSPAN="2"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New   Roman" SIZE="2">President, Chief Executive Officer and Director</FONT></P> <P STYLE="margin- top:0px;margin-bottom:1px"><FONT FACE="Times New Roman"
SIZE="2">(Principal Executive Officer)</FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;</FONT></TD>
<TD VALIGN="top" ROWSPAN="2" ALIGN="center"><FONT FACE="Times New Roman" SIZE="2">March 24,  2005</FONT></TD></TR>
"""

soup = BeautifulSoup(html)

print("\n".join([x.text.strip() for x in soup.find_all("td")]))

/s/ ROBERT F. MANGANO

President, Chief Executive Officer and Director (Principal Executive Officer)

March 24,  2005

相关问题更多 >

编程相关推荐

热门问题

热门文章