如何通过Python中的网页抓取隐藏在“全部显示”之外的完整表

2024-10-02 20:30:16 发布

您现在位置:Python中文网/ 问答频道 /正文

根据我在previous question中找到的回复,我可以通过web从URL:https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html抓取Python中的表,但它只抓取部分,直到出现“Show all”行

我如何用Python获取隐藏在“全部显示”之外的完整表

以下是我正在使用的代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup
#
vaccineDF = pd.read_html('https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html')[0]
vaccineDF = vaccineDF.reset_index(drop=True)
print(vaccineDF.head(100))

输出仅获取15行(直到全部显示):

   Unnamed: 0_level_0 Doses administered  ... Unnamed: 8_level_0 Unnamed: 9_level_0
   Unnamed: 0_level_1     Per 100 people  ... Unnamed: 8_level_1 Unnamed: 9_level_1
0               World                 11  ...                NaN                NaN
1              Israel                116  ...                NaN                NaN
2          Seychelles                116  ...                NaN                NaN
3              U.A.E.                 99  ...                NaN                NaN
4               Chile                 69  ...                NaN                NaN
5             Bahrain                 66  ...                NaN                NaN
6              Bhutan                 63  ...                NaN                NaN
7                U.K.                 62  ...                NaN                NaN
8       United States                 61  ...                NaN                NaN
9          San Marino                 60  ...                NaN                NaN
10           Maldives                 59  ...                NaN                NaN
11              Malta                 55  ...                NaN                NaN
12             Monaco                 53  ...                NaN                NaN
13            Hungary                 45  ...                NaN                NaN
14             Serbia                 44  ...                NaN                NaN
15           Show all           Show all  ...           Show all           Show all

下面是部分表格的屏幕截图,直到在web(左侧部分)和相应的检查元素(右侧部分)中显示“全部”: enter image description here


Tags: httpsimportcomwebworldhtmlwwwshow
2条回答
  • OWID提供了这些数据,这些数据实际上来自JHU
  • 如果您需要按国家列出的最新疫苗接种数据,使用CSV界面很简单
import requests, io
dfraw = pd.read_csv(io.StringIO(requests.get("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv").text))
dfraw["date"] = pd.to_datetime(dfraw["date"])

dfraw.sort_values(["iso_code","date"]).groupby("iso_code", as_index=False).last()

您不能直接打印整个数据,因为单击Show all按钮后可以看到完整的数据。因此,从这个场景中,我们可以理解,首先我们必须创建一个on click()事件来单击Show all按钮,然后才能获取整个表

我已经为on click事件使用了Selenium库来按下Show all按钮。对于这个特定场景,我使用了SeleniumFirefox() Webdriverurl获取所有data。请参考下面给出的代码获取给定COVID Dataset URL的整个表:

# Import all the Important Libraries
from selenium import webdriver # This module help to fetch data and on-click event purpose
from pandas.io.html import read_html # This module will help to read 'html' source. So, we can __scrape__ data from it
import pandas as pd # This Module will help to Convert Our Data into 'DataFrame'
  
# Create 'FireFox' Webdriver Object
driver = webdriver.Firefox()

# Get Website
driver.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")

# Find 'Show all' Button Using 'XPath'
show_all_button = driver.find_element_by_xpath("/html/body/div[1]/main/article/section/div/div/div[4]/div[1]/div/table/tbody/tr[16]")

# Click 'Show all' Button
show_all_button.click()

# Get 'HTML' Content of Page
html_data = driver.page_source

在获取整个数据之后,让我们看看COVID Dataset URL中有多少个表

covid_data_tables = read_html(html_data,  attrs = {"class":"g-summary-table  svelte-2wimac"}, header = None)

# Print Number of Tables Extracted
print ("\nExtracted {num} COVID Data Table".format(num = len(covid_data_tables)), "\n")
# Output of Above Cell:-

Extracted 1 COVID Data Table

现在,让我们获取数据表:-

# Print Table Data

covid_data_tables[0].head(20)
# Output of above cell:-
Unnamed: 0_level_0      Doses administered         Pct. of population
Unnamed: 0_level_1      Per 100 people  Total      Vaccinated   Fully vaccinated
0   World               11              877933955  –            –
1   Israel              116             10307583   60%          56%
2   Seychelles          116             112194     68%          47%
3   U.A.E.              99              9489684    –            –
4   Chile               69              12934282   41%          28%
5   Bahrain             66              1042463    37%          29%
6   Bhutan              63              478219     63%          –
7   U.K.                62              41505768   49%          13%
8   United States       61              202282923  38%          24%
9   San Marino          60              20424      35%          25%
10  Maldives            59              303752     53%          5.6%
11  Malta               55              264658     38%          17%
12  Monaco              53              20510      30%          23%
13  Hungary             45              4416581    32%          14%
14  Serbia              44              3041740    26%          17%
15  Qatar               43              1209648    –            –
16  Uruguay             38              1310591    30%          8.3%
17  Singapore           30              1667522    20%          9.5%
18  Antigua and Barbuda 28              27032      28%          –
19  Iceland             28              98672      20%          8.1%

正如您所看到的,它没有在我们的数据集中显示show all。现在我们可以把这个Data Table转换成DataFrame。为了完成这个任务,我们必须将这个Data存储为CSV格式,我们可以重新加载它并将它存储在DataFrame。其代码如下所述:

# HTML Table to CSV Format Conversion For COVID Dataset
covid_data_file = 'covid_data.csv'
covid_data_tables[0].to_csv(covid_data_file, sep = ',')

# Read CSV Data From Data Table for Further Analysis
covid_data = pd.read_csv("covid_data.csv")

因此,在将所有Data存储为csv格式之后,让我们将数据转换为DataFrame格式并打印整个数据:-

# Store 'CSV' Data into 'DataFrame' Format
vaccineDF = pd.DataFrame(covid_data)
vaccineDF = vaccineDF.drop(columns=["Unnamed: 0"], axis = 1) # 'drop' Unneccesary Columns from the Dataset

# Print Whole Dataset
vaccineDF
# Output of above cell:-
    Unnamed: 0_level_0  Doses administered  Doses administered.1    Pct. of population  Pct. of population.1
0   Unnamed: 0_level_1  Per 100 people      Total                   Vaccinated          Fully vaccinated
1   World               11                  877933955               –                   – 
2   Israel              116                 10307583                60%                 56%
3   Seychelles          116                 112194                  68%                 47%
4   U.A.E.              99                  9489684                 –                   –
... ...                 ...                 ...                     ...                 ...
154 Syria               <0.1                2500                    <0.1%               –
155 Papua New Guinea    <0.1                1081                    <0.1%               –
156 South Sudan         <0.1                947                     <0.1%               –
157 Cameroon            <0.1                400                     <0.1%               –
158 Zambia              <0.1                106                     <0.1%               –

159 rows × 5 columns

从上面的输出可以看出,我们已经成功地获取了整个data table。希望这个解决方案能对您有所帮助

相关问题 更多 >