<p><a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags">Do not parse HTML with regex.</a>使用一个专门的工具-an<code>HTML Parser</code>。你知道吗</p>
<p>下面是使用<a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">^{<cd2>}</a>的解决方案:</p>
<pre><code>import urllib2
from bs4 import BeautifulSoup
base_url = "http://www.boostmobile.com/stores/?page={page}&zipcode={zipcode}"
num_pages = 10
zipcode = 30008
for page in xrange(1, num_pages + 1):
url = base_url.format(page=page, zipcode=zipcode)
soup = BeautifulSoup(urllib2.urlopen(url))
print "Page Number: %s" % page
results = soup.find('table', class_="results")
for h2 in results.find_all('h2'):
print h2.text
</code></pre>
<p>它打印:</p>
<pre><code>Page Number: 1
Boost Mobile Store by Wireless Depot
Boost Mobile Store by KOB Wireless
Marietta Check Cashing Services
...
Page Number: 2
Target
Wal-Mart
...
</code></pre>
<p>如您所见,首先我们找到一个带有<code>results</code>类的<code>table</code>标记—这就是商店名称的实际位置。然后,在<code>table</code>中我们找到了所有的<code>h2</code>标记。这比依赖标签的<code>style</code>属性更健壮。你知道吗</p>
<hr/>
<p>您还可以使用<a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document" rel="nofollow noreferrer">^{<cd8>}</a>。它将提高性能,因为它只解析您指定的文档部分:</p>
<pre><code>required_part = SoupStrainer('table', class_="results")
for page in xrange(1, num_pages + 1):
url = base_url.format(page=page, zipcode=zipcode)
soup = BeautifulSoup(urllib2.urlopen(url), parse_only=required_part)
print "Page Number: %s" % page
for h2 in soup.find_all('h2'):
print h2.text
</code></pre>
<p>这里我们说:“只解析带有类<code>results</code>的<code>table</code>标记。把里面的<code>h2</code>标签都给我们。”</p>
<p>另外,如果要提高性能,可以<a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser" rel="nofollow noreferrer">let ^{<cd2>} use ^{<cd13>} parser under the hood</a>:</p>
<pre><code>soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=required_part)
</code></pre>
<p>希望有帮助。你知道吗</p>