擅长:python、mysql、java
<pre><code>from BeautifulSoup import BeautifulSoup
import re
html = """
<div>hello</div>
<a href="/index.html">Not this one</a>"
<a href="http://google.com">Link 1</a>
<a href="http:/amazon.com">Link 2</a>
"""
def processor(tag):
href = tag.get('href')
if not href: return False
return True if (href.find("google") == -1) else False
soup = BeautifulSoup(html)
back_links = soup.findAll(processor, href=re.compile(r"^http"))
print back_links
output:
[<a href="http:/amazon.com">Link 2</a>]
</code></pre>
<p>然而,只需获取以http开头的所有链接,然后在这些链接中搜索href中没有“google”的链接可能更有效:</p>
^{pr2}$