<p>站点似乎是动态的,因为快速检查源代码就会发现表本身并不是在DOM中呈现的。因此,您需要使用浏览器操作工具,如<code>selenium</code>:</p>
<pre><code>from selenium import webdriver
from bs4 import BeautifulSoup as soup
import re
from collections import namedtuple
d = webdriver.Chrome('/Users/jamespetullo/Downloads/chromedriver')
d.get('https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/scores')
def page_results(html):
school = namedtuple('school', ['ranking', 'name', 'location', 'scores'])
rankings = [i.text for i in soup(html, 'lxml').find_all('td', {'class':'rank sorting_1 sorting_2'})]
names = [i.text for i in soup(html, 'lxml').find_all('a', {'class':'ranking-institution-title'})]
locations = [i.text for i in soup(html, 'lxml').find_all('div', {'class':'location'})]
full_scores = [i.text for i in soup(html, 'lxml').find_all('td', {'class':re.compile('scores\s+[\w_]+\-score')})]
final_scores = [dict(zip(['overall', 'teaching', 'research', 'citations', 'income', 'outlook'], full_scores[i:i+6])) for i in range(0, len(full_scores), 6)]
return [school(*i) for i in zip(rankings, names, locations, final_scores)]
pages = [page_results(d.page_source)]
links = d.find_elements_by_tag_name('a')
for link in links:
if link.text.isdigit():
try:
link.click()
pages.append(page_results(d.page_source))
except:
pass
</code></pre>
<p>输出示例:</p>
^{pr2}$