<p>下面是两种方法。在</p>
<pre><code>import mechanize
import requests
from bs4 import BeautifulSoup, SoupStrainer
import urlparse
import pprint
# Mechanize
br = mechanize.Browser()
def get_links_mechanize(root):
links = []
br.open(root)
for link in br.links():
try:
if dict(link.attrs)['class'] == 'page':
links.append(link.absolute_url)
except:
pass
return links
# Requests / BeautifulSoup / urlparse
def get_links_bs(root):
links = []
r = requests.get(root)
for link in BeautifulSoup(r.text, parse_only=SoupStrainer('a')):
if link.has_attr('href') and link.has_attr('class') and 'page' in link.get('class'):
links.append(urlparse.urljoin(root, link.get('href')))
return links
#with open("C:\Users\Administrator\Desktop\\3.txt","r") as f:
# for root in f:
# links = get_links(root)
# # <Do something with links>
root = 'http://www.apartmentguide.com/apartments/Alabama/Hartselle/'
print "Mech:"
pprint.pprint( get_links_mechanize(root) )
print "Requests/BS4/urlparse:"
pprint.pprint( get_links_bs(root) )
</code></pre>
<p>一种使用<code>mechanize</code>使用url会更聪明一些,但速度要慢得多,而且可能会过度依赖于你正在做的其他事情。在</p>
<p>另一个使用<code>requests</code>来获取页面(urllib2就足够了),<code>BeautifulSoup</code>来解析标记,而{<cd4>}则从你列出的页面中的相对url中形成绝对url。在</p>
<p>请注意,这两个函数都返回以下列表:</p>
^{pr2}$
<p>有重复的。你可以通过改变</p>
<pre><code>return links
</code></pre>
<p>到</p>
<pre><code>return list(set(links))
</code></pre>
<p>不管你选择什么方法。在</p>
<p><strong>编辑:</strong></p>
<p>我注意到上面的函数只返回到第2-5页的链接,而您必须浏览这些页面才能看到实际上有10个页面。在</p>
<p>一种完全不同的方法是从“根”页面中获取结果的数量,然后预测将产生多少个页面,然后从中构建链接。在</p>
<p>由于每页有20个结果,计算出多少页是很简单的,考虑一下:</p>
<pre><code>import requests, re, math, pprint
def scrape_results(root):
links = []
r = requests.get(root)
mat = re.search(r'We have (\d+) apartments for rent', r.text)
num_results = int(mat.group(1)) # 182 at the moment
num_pages = int(math.ceil(num_results/20.0)) # ceil(182/20) => 10
# Construct links for pages 1-10
for i in range(num_pages):
links.append("%s?page=%d" % (root, (i+1)))
return links
pprint.pprint(scrape_results(root))
</code></pre>
<p>这将是3种方法中最快的,但可能更容易出错。在</p>
<p><strong>编辑2</strong>:</p>
<p>可能是这样的:</p>
<pre><code>import re, math, pprint
import requests, urlparse
from bs4 import BeautifulSoup, SoupStrainer
def get_pages(root):
links = []
r = requests.get(root)
mat = re.search(r'We have (\d+) apartments for rent', r.text)
num_results = int(mat.group(1)) # 182 at the moment
num_pages = int(math.ceil(num_results/20.0)) # ceil(182/20) => 10
# Construct links for pages 1-10
for i in range(num_pages):
links.append("%s?page=%d" % (root, (i+1)))
return links
def get_listings(page):
links = []
r = requests.get(page)
for link in BeautifulSoup(r.text, parse_only=SoupStrainer('a')):
if link.has_attr('href') and link.has_attr('data-listingid') and 'name' in link.get('class'):
links.append(urlparse.urljoin(root, link.get('href')))
return links
root='http://www.apartmentguide.com/apartments/Alabama/Hartselle/'
listings = []
for page in get_pages(root):
listings += get_listings(page)
pprint.pprint(listings)
print(len(listings))
</code></pre>