用python从web上抓取表

2024-10-02 20:38:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从这个网站-https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/scores获取孔表(所有1000+所大学)。在

为此,我使用了以下库-requests和BeautifulSoup,我的代码是:

import requests
from bs4 import BeautifulSoupenter 

html_content = requests.get('https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
soup = bs4.BeautifulSoup(html_content, 'lxml')

我要找张桌子:

^{pr2}$

但结果是,我看不到表本身<tbody>、行<tr>和列<td>。在

HTML代码:

请帮助mу从该站点获取所有信息并从中构建一个数据帧。在


Tags: httpscomworldbywwwpageordersort
2条回答

试试下面的方法。如果您查看devtools下network选项卡中的networkactivityatxhr部分,就可以获得url。但是,从json响应中获取数据的脚本应该是这样的。在

import requests

URL = "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json"

res = requests.get(URL)
for items in res.json()['data']:
    rank = items['rank']
    name = items['name']
    intstudents = items['stats_pc_intl_students']
    ratio = items['stats_female_male_ratio']
    print(rank,name,intstudents,ratio)

输出:

^{pr2}$

站点似乎是动态的,因为快速检查源代码就会发现表本身并不是在DOM中呈现的。因此,您需要使用浏览器操作工具,如selenium

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import re
from collections import namedtuple
d = webdriver.Chrome('/Users/jamespetullo/Downloads/chromedriver')
d.get('https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/scores')
def page_results(html):
   school = namedtuple('school', ['ranking', 'name', 'location', 'scores'])
   rankings = [i.text for i in soup(html, 'lxml').find_all('td', {'class':'rank sorting_1 sorting_2'})]
   names = [i.text for i in soup(html, 'lxml').find_all('a', {'class':'ranking-institution-title'})]
   locations = [i.text for i in soup(html, 'lxml').find_all('div', {'class':'location'})]
   full_scores = [i.text for i in soup(html, 'lxml').find_all('td', {'class':re.compile('scores\s+[\w_]+\-score')})]
   final_scores = [dict(zip(['overall', 'teaching', 'research', 'citations', 'income', 'outlook'], full_scores[i:i+6])) for i in range(0, len(full_scores), 6)]
   return [school(*i) for i in zip(rankings, names, locations, final_scores)]

pages = [page_results(d.page_source)]
links = d.find_elements_by_tag_name('a')
for link in links:
   if link.text.isdigit():
      try:
        link.click()
        pages.append(page_results(d.page_source))
      except:
        pass

输出示例:

^{pr2}$

相关问题 更多 >