用python从web上抓取表

import requests from bs4 import BeautifulSoupenter html_content = requests.get('https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats') soup = bs4.BeautifulSoup(html_content, 'lxml')

2条回答

网友

1楼 · 编辑于 2024-10-02 20:38:50

试试下面的方法。如果您查看devtools下network选项卡中的networkactivityatxhr部分，就可以获得url。但是，从json响应中获取数据的脚本应该是这样的。在

import requests

URL = "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json"

res = requests.get(URL)
for items in res.json()['data']:
    rank = items['rank']
    name = items['name']
    intstudents = items['stats_pc_intl_students']
    ratio = items['stats_female_male_ratio']
    print(rank,name,intstudents,ratio)

输出：

^{pr2}$

网友

2楼 · 编辑于 2024-10-02 20:38:50

站点似乎是动态的，因为快速检查源代码就会发现表本身并不是在DOM中呈现的。因此，您需要使用浏览器操作工具，如selenium：

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import re
from collections import namedtuple
d = webdriver.Chrome('/Users/jamespetullo/Downloads/chromedriver')
d.get('https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/scores')
def page_results(html):
   school = namedtuple('school', ['ranking', 'name', 'location', 'scores'])
   rankings = [i.text for i in soup(html, 'lxml').find_all('td', {'class':'rank sorting_1 sorting_2'})]
   names = [i.text for i in soup(html, 'lxml').find_all('a', {'class':'ranking-institution-title'})]
   locations = [i.text for i in soup(html, 'lxml').find_all('div', {'class':'location'})]
   full_scores = [i.text for i in soup(html, 'lxml').find_all('td', {'class':re.compile('scores\s+[\w_]+\-score')})]
   final_scores = [dict(zip(['overall', 'teaching', 'research', 'citations', 'income', 'outlook'], full_scores[i:i+6])) for i in range(0, len(full_scores), 6)]
   return [school(*i) for i in zip(rankings, names, locations, final_scores)]

pages = [page_results(d.page_source)]
links = d.find_elements_by_tag_name('a')
for link in links:
   if link.text.isdigit():
      try:
        link.click()
        pages.append(page_results(d.page_source))
      except:
        pass

输出示例：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章