使用Python BeautifulSoup提取HTML表

<table> <thead> <tr> <th>Period Ending:</th> <th class="TalignL">Trend</th> <th>9/27/2014</th> <th>9/28/2013</th> <th>9/29/2012</th> <th>9/24/2011</th> </tr> </thead> <tr> <th bgcolor="#E6E6E6">Total Revenue</th> <td class="td_genTable"><table border="0" align="center" width="*" cellspacing="0" cellpadding="0"><tr><td align="bottom"><table border="0" height="100%" cellspacing="0" cellpadding="0"><tr><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="15" bgcolor="#47C3D3" width="6"></td><td height="15" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="1" bgcolor="#FFFFFF" width="6"></td><td height="1" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="14" bgcolor="#47C3D3" width="6"></td><td height="14" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="2" bgcolor="#FFFFFF" width="6"></td><td height="2" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="13" bgcolor="#47C3D3" width="6"></td><td height="13" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="7" bgcolor="#FFFFFF" width="6"></td><td height="7" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="8" bgcolor="#47C3D3" width="6"></td><td height="8" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="1" bgcolor="#D1D1D1"></td></tr></table></td></tr></table></td></tr></table></td> <td>$182,795,000</td> <td>$170,910,000</td> <td>$156,508,000</td> <td>$108,249,000</td>

2条回答

网友

1楼 · 编辑于 2024-09-27 09:27:26

补充了@abarner指出的。我将得到所有以$开头的文本的td元素：

for row in soup.table.find_all('tr', recursive=False):
    record = [td.text.replace(",", "") for td in row.find_all("td", text=lambda x: x and x.startswith("$"))]
    print record

对于您提供的输入，它将打印：

^{pr2}$

您可以将其“解包”为单独的变量：

account, period1, period2, period3 = record

请注意，我显式地传递recursive=False，以避免在树中更深入，只获得table元素的直接tr子元素。在

网友

2楼 · 编辑于 2024-09-27 09:27:26

您的第一个问题是^{}（或findAll，这只是同一事物的一个不推荐使用的同义词）不只是查找表中的行，而是查找表中的行以及表中的每个子表中的行。几乎可以肯定的是，您不希望迭代两种类型的行并在每一行上运行相同的代码。如果您不想这样，就像the ^{} argument文档所说的那样，请通过recursive=False。在

所以，现在你只能回来一排了。如果你做了row.find_all('td')，那又会有同样的问题，你将找到这一行的所有列，以及其中一列中每个子表中每一行的所有列。同样，这不是您想要的，所以使用recursive=False。在

现在你只得到了5列。第一张是一张大桌子，里面有一堆空单元格；另一方面，另一方面，另一方面，里面有美元的价值，这似乎是你想要的。在

所以，只需将recursive=False添加到两个调用中，并将stock设置为某个值（我不知道它在代码中应该来自何处，但是如果没有它，您显然只会得到一个NameError）：

stock = 'spam'

rows = table.find_all('tr', recursive=False)

for row in rows:
    cols = row.findAll('td', recursive=False)
    col1 = [ele.text.strip().replace(',','') for ele in cols]

    account = col1[0:1]
    period1 = col1[2:3]
    period2 = col1[3:4]
    period3 = col1[4:5]

    record = (stock, account,period1,period3,period3)

    print record

这将打印：

^{pr2}$
我不知道为什么您两次使用period3而从未使用period2，为什么完全跳过第1列，或者为什么要切片1元素列表而不是仅仅索引值，但是不管怎样，这似乎就是您要做的。在
作为补充说明，如果您真的希望将列表分解为5个值，而不是分成4个单元素列表，跳过其中一个值，您可以写下：
account, whatever, period1, period2, period3 = col

相关问题更多 >

编程相关推荐

热门问题

热门文章