擅长:python、mysql、java
<p>基于BeautifulSoup的解决方案:</p>
<pre><code>from bs4 import BeautifulSoup
import urllib2
site= "http://en.wikipedia.org/wiki/Aldi"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page.read())
table = soup.find('table', class_='infobox vcard')
result = {}
exceptional_row_count = 0
for tr in table.find_all('tr'):
if tr.find('th'):
result[tr.find('th').text] = tr.find('td').text
else:
# the first row Logos fall here
exceptional_row_count += 1
if exceptional_row_count > 1:
print 'WARNING ExceptionalRow>1: ', table
print result
</code></pre>
<p>在<a href="http://en.wikipedia.org/wiki/Aldi" rel="nofollow">http://en.wikipedia.org/wiki/Aldi</a>上测试,但未在其他wiki页面上完全测试。在</p>