Python webscraping并获取其类的第一个div标记的内容

from urllib.request import urlopen from bs4 import BeautifulSoup import re html = urlopen("http://www.nasdaq.com/markets/ipos/").read() soup = BeautifulSoup(html) for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}): s = div.string if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s): div_next = div.find_next('div') print('{} - {}'.format(s, div_next.string))

from urllib.request import urlopen from bs4 import BeautifulSoup import re html = urlopen("http://www.nasdaq.com/markets/ipos/").read() soup = BeautifulSoup(html) divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0] table= divparent.find('table') for div in table.find_all('div', attrs={'class':'ipo-cell-height'}): s = div.string if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s): div_next = div.find_next('div') print('{} - {}'.format(s, div_next.string))

1条回答

网友

1楼 · 发布于 2024-05-05 00:00:39

您提到有两个元素符合'class':'genTable thin floatL'标准。因此，为第一个元素运行for循环没有意义。在

所以用

divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]

现在您不必再做soup.find_all。这样做将搜索整个文档。您需要将搜索限制为divparent。所以，你要：

^{pr2}$

提取日期和公司名称的代码的其余部分将是相同的，只是它们将引用table变量。在

for row in table.find_all('tr'):
    for data in row.find_all('td'):
        print data.string

希望有帮助。在

相关问题更多 >

编程相关推荐

热门问题

热门文章