Python>Beautifulsoup>Webscraping>URL循环（1到53）并保存结果问题的回答

Python>Beautifulsoup>Webscraping>URL循环（1到53）并保存结果

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

<a href="http://livingwage.mit.edu/">Here is the Website I am trying to scrape http://livingwage.mit.edu/</a> 特定的URL来自 <pre><code>http://livingwage.mit.edu/states/01 http://livingwage.mit.edu/states/02 http://livingwage.mit.edu/states/04 (For some reason they skipped 03) ...all the way to... http://livingwage.mit.edu/states/56 </code></pre> 在每个URL上，我需要第二个表的最后一行： <blockquote> Example for <a href="http://livingwage.mit.edu/states/01">http://livingwage.mit.edu/states/01</a> Required annual income before taxes $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997 </blockquote> 期望输出： 阿拉巴马州20260美元42786美元51642美元64767美元34325美元42305美元47345美元53206美元34325美元47691美元56934美元66997美元 阿拉斯加24070美元49295美元60933美元79871美元38561美元47136美元52233美元61531美元38561美元54433美元66316美元82403美元 。。。在 。。。在 怀俄明州20867美元42689美元52007美元65892美元34988美元41887美元46983美元53549美元34988美元47826美元57391美元68424美元 经过2个小时的磨蹭之后，我现在所掌握的是（我是初学者）： ^{pr2}$ 当我在Python控制台中查看状态名称和行时，它给了我html元素 <pre><code>[<h1>Living Wag...Alabama</h1>] </code></pre> 以及 <pre><code>[<tr class = "odd... </td> </tr>] </code></pre> 问题1：这些都是我想要的输出，但是我怎样才能让python以字符串格式而不是像上面那样的HTML格式给我呢？在 问题2：如何循环请求.get（url01到url56）？在 谢谢你的帮助。在 如果您能提供一种更有效的方法来获取代码中的rows变量，我将非常感激，因为我获得该变量的方法不是很像python。在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

只需从初始页面获取所有状态，然后您就可以选择第二个表并使用css classesodd results来获得所需的tr，因为类名是唯一的，因此不需要切片： <pre><code>import requests from bs4 import BeautifulSoup from urllib.parse import urljoin # python2 -> from urlparse import urljoin base = "http://livingwage.mit.edu" res = requests.get(base) res.raise_for_status() states = [] # Get all state urls and state name from the anchor tags on the base page. # td + td skips the first td which is *Required annual income before taxes* # get all the anchors inside each li that are children of the # ul with the css class "states list". for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"): # The hrefs look like "/states/51/locations". # We want everything before /locations so we split on / from the right -> /states/51/ # and join to the base url. The anchor text also holds the state name, # so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama". states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text)) def parse(soup): # Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table. table = soup.select_one("table:nth-of-type(2)") # To get the text, we just need find all the tds and call .text on each. # Each td we want has the css class "odd results", td + td starts from the second as we don't want the first. return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")] # Unpack the url and state from each tuple in our states list. for url, state in states: soup = BeautifulSoup(requests.get(url).content, "html.parser") print(state, parse(soup)) </code></pre> 如果运行代码，您将看到如下输出： ^{pr2}$ 您可以在1-53的范围内循环，但是从基本页提取锚定也可以在一个步骤中为我们提供状态名，使用该页面中的h1还可以得到阿拉巴马州的生活工资计算结果，然后你必须尝试解析，以获得名称，考虑到一些州有更多的单字名称，这不是一件小事。在

Python>Beautifulsoup>Webscraping>URL循环（1到53）并保存结果

1 个回答

相关Python问题