<p>只需从初始页面获取所有状态,然后您就可以选择第二个表并使用<em>css classes</em><em>odd results</em>来获得所需的<em>tr</em>,因为类名是唯一的,因此不需要切片:</p>
<pre><code>import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin # python2 -> from urlparse import urljoin
base = "http://livingwage.mit.edu"
res = requests.get(base)
res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.
# td + td skips the first td which is *Required annual income before taxes*
# get all the anchors inside each li that are children of the
# ul with the css class "states list".
for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
# The hrefs look like "/states/51/locations".
# We want everything before /locations so we split on / from the right -> /states/51/
# and join to the base url. The anchor text also holds the state name,
# so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))
def parse(soup):
# Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
table = soup.select_one("table:nth-of-type(2)")
# To get the text, we just need find all the tds and call .text on each.
# Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.
return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]
# Unpack the url and state from each tuple in our states list.
for url, state in states:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(state, parse(soup))
</code></pre>
<p>如果运行代码,您将看到如下输出:</p>
^{pr2}$
<p>您可以在1-53的范围内循环,但是从基本页提取锚定也可以在一个步骤中为我们提供状态名,使用该页面中的h1还可以得到阿拉巴马州的生活工资计算结果,然后你必须尝试解析,以获得名称,考虑到一些州有更多的单字名称,这不是一件小事。在</p>