Python>Beautifulsoup>Webscraping>URL循环（1到53）并保存结果

http://livingwage.mit.edu/states/01 http://livingwage.mit.edu/states/02 http://livingwage.mit.edu/states/04 (For some reason they skipped 03) ...all the way to... http://livingwage.mit.edu/states/56

2条回答

网友

1楼 · 编辑于 2024-10-04 05:23:52

Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above?

您可以通过简单地在以下行中执行操作来获取文本：

state_name=states.find('h1').text

同样的方法也适用于每一行。在

Problem 2: How do I loop through the request.get(url01 to url56)?

相同的代码块可以放在从1到56的循环中，如下所示：

^{pr2}$

zfill将添加这些前导零。另外，如果requests.get包含在一个try-except块中，这样即使url错误，循环也会继续正常进行。在

网友

2楼 · 编辑于 2024-10-04 05:23:52

只需从初始页面获取所有状态，然后您就可以选择第二个表并使用css classesodd results来获得所需的tr，因为类名是唯一的，因此不需要切片：

import requests
from bs4 import BeautifulSoup
from urllib.parse import  urljoin # python2 -> from urlparse import urljoin 


base = "http://livingwage.mit.edu"
res = requests.get(base)

res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.
# td + td skips the first td which is *Required annual income before taxes*
# get all the anchors inside each li that are children of the
# ul with the css class  "states list".
for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
    # The hrefs look like "/states/51/locations".
    #  We want everything before /locations so we split on / from the right -> /states/51/
    # and join to the base url. The anchor text also holds the state name,
    # so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
    states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))


def parse(soup):
    # Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
    table = soup.select_one("table:nth-of-type(2)")
    # To get the text, we just need find all the tds and call .text on each.
    #  Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.
    return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]


# Unpack the url and state from each tuple in our states list. 
for url, state in states:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print(state, parse(soup))

如果运行代码，您将看到如下输出：

^{pr2}$

您可以在1-53的范围内循环，但是从基本页提取锚定也可以在一个步骤中为我们提供状态名，使用该页面中的h1还可以得到阿拉巴马州的生活工资计算结果，然后你必须尝试解析，以获得名称，考虑到一些州有更多的单字名称，这不是一件小事。在

相关问题更多 >

编程相关推荐

热门问题

热门文章