如何用beauthoulsoup选择一些网址？

... <td>7</td> <td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td> <td bgcolor="" align="left">New York</td> <td bgcolor="" align="left" class="Region">N/A</td> <td bgcolor="" align="left">1,863</td> <td bgcolor="" align="left">565</td> <td bgcolor="" align="left">1,133</td> <td bgcolor="" align="left">$160,000</td> <td bgcolor="" align="center"><a class="xnternal" href="/nlj250/firmDetail/7"> View Profile </a></td></tr><tr class="small" bgcolor="#FFFFFF"> ...

2条回答

网友

1楼 · 编辑于 2024-07-04 08:15:18

我想这可能就是你要找的。attrs参数有助于隔离所需的部分。在

from BeautifulSoup import BeautifulSoup
import urllib

soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))

rows = soup.findAll(name='tr',attrs={'class':'small'})
for row in rows:
    number = row.find('td').text
    tds = row.findAll(name='td',attrs={'align':'left'})
    link = tds[0].find('a')['href']
    firm = tds[0].text
    office = tds[1].text
    attorneys = tds[3].text
    partners = tds[4].text
    associates = tds[5].text
    salary = tds[6].text
    print number, firm, office, attorneys, partners, associates, salary

网友

2楼 · 编辑于 2024-07-04 08:15:18

我将在class=列表的表中获取每个tr。你的搜索范围显然太广了，找不到你想要的信息。因为HTML有一个结构，所以您可以轻松地获得表数据。从长远来看，这比获得所有的href并过滤掉那些你不想退出的href要容易得多。beauthulsoup有大量关于如何做到这一点的文档。http://www.crummy.com/software/BeautifulSoup/documentation.html

不精确代码：

for tr in soup.findAll('tr'):
  data_list = tr.children()
  data_list[0].content  # 7
  data_list[1].content  # New York
  data_list[2].content # Region <  ignore this
  # etc

相关问题更多 >

编程相关推荐

热门问题

热门文章