Python>Beautifulsoup>Webscraping>URL循环(1到53)并保存结果

2024-10-04 05:23:52 发布

您现在位置:Python中文网/ 问答频道 /正文

Here is the Website I am trying to scrape http://livingwage.mit.edu/

特定的URL来自

http://livingwage.mit.edu/states/01

http://livingwage.mit.edu/states/02

http://livingwage.mit.edu/states/04 (For some reason they skipped 03)

...all the way to...

http://livingwage.mit.edu/states/56

在每个URL上,我需要第二个表的最后一行:

Example for http://livingwage.mit.edu/states/01

Required annual income before taxes $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997

期望输出:

阿拉巴马州20260美元42786美元51642美元64767美元34325美元42305美元47345美元53206美元34325美元47691美元56934美元66997美元

阿拉斯加24070美元49295美元60933美元79871美元38561美元47136美元52233美元61531美元38561美元54433美元66316美元82403美元

。。。在

。。。在

怀俄明州20867美元42689美元52007美元65892美元34988美元41887美元46983美元53549美元34988美元47826美元57391美元68424美元

经过2个小时的磨蹭之后,我现在所掌握的是(我是初学者):

^{pr2}$

当我在Python控制台中查看状态名称和行时,它给了我html元素

[<h1>Living Wag...Alabama</h1>]

以及

[<tr class = "odd...   </td> </tr>]

问题1:这些都是我想要的输出,但是我怎样才能让python以字符串格式而不是像上面那样的HTML格式给我呢?在

问题2:如何循环请求.get(url01到url56)?在

谢谢你的帮助。在

如果您能提供一种更有效的方法来获取代码中的rows变量,我将非常感激,因为我获得该变量的方法不是很像python。在


Tags: theto方法httpurlhereismit
2条回答

Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above?

您可以通过简单地在以下行中执行操作来获取文本:

state_name=states.find('h1').text

同样的方法也适用于每一行。在

Problem 2: How do I loop through the request.get(url01 to url56)?

相同的代码块可以放在从1到56的循环中,如下所示:

^{pr2}$

zfill将添加这些前导零。另外,如果requests.get包含在一个try-except块中,这样即使url错误,循环也会继续正常进行。在

只需从初始页面获取所有状态,然后您就可以选择第二个表并使用css classesodd results来获得所需的tr,因为类名是唯一的,因此不需要切片:

import requests
from bs4 import BeautifulSoup
from urllib.parse import  urljoin # python2 -> from urlparse import urljoin 


base = "http://livingwage.mit.edu"
res = requests.get(base)

res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.
# td + td skips the first td which is *Required annual income before taxes*
# get all the anchors inside each li that are children of the
# ul with the css class  "states list".
for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
    # The hrefs look like "/states/51/locations".
    #  We want everything before /locations so we split on / from the right -> /states/51/
    # and join to the base url. The anchor text also holds the state name,
    # so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
    states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))


def parse(soup):
    # Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
    table = soup.select_one("table:nth-of-type(2)")
    # To get the text, we just need find all the tds and call .text on each.
    #  Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.
    return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]


# Unpack the url and state from each tuple in our states list. 
for url, state in states:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print(state, parse(soup))

如果运行代码,您将看到如下输出:

^{pr2}$

您可以在1-53的范围内循环,但是从基本页提取锚定也可以在一个步骤中为我们提供状态名,使用该页面中的h1还可以得到阿拉巴马州的生活工资计算结果,然后你必须尝试解析,以获得名称,考虑到一些州有更多的单字名称,这不是一件小事。在

相关问题 更多 >