beauthulsoup无法解析内容,因为页面也加载了

2024-06-03 01:14:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我一直试图从url http://podaac.jpl.nasa.gov/ws/search/granule/index.html解析一个表,但是这个表太大了,以至于在加载网站后需要几毫秒的时间来加载表。因为beautiful soup是网站返回的第一个实例,所以它不能加载完整的表,只能加载表的标题。在

from bs4 import BeautifulSoup as bs 
import requests

datasetIds = []
html = requests.get('http://podaac.jpl.nasa.gov/ws/search/granule/index.html')
soup = bs(html.text, 'html.parser')

table = soup.find("table", {"id": "tblDataset"})
print table
rows = table.find_all('tr')
rows.remove(rows[0])

for row in rows:
   x = row.find_all('td')
   datasetIds.append(x[1].text.encode('utf-8'))

print datasetIds

代码必须返回第一个表中的datasetid,但它只返回表的标题。提前感谢您的帮助!:)


Tags: httpsearchindexwshtmltablefindjpl
2条回答

正如@Hassan Mehmood所提到的,您必须使用selenium(或任何其他headles浏览器)来实现这一点,因为表是用javascript生成的。beauthoulsoup不计算javascript,也不能用于获取所需的数据。在

你可以以此为起点:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

log = logging.getLogger(__name__)

logging.getLogger('selenium').setLevel(logging.WARNING)
logging.getLogger('requests').setLevel(logging.WARNING)


def test():
    url = 'http://podaac.jpl.nasa.gov/ws/search/granule/index.html'
    wait_for_element = 30

    s = webdriver.PhantomJS()
    s.set_window_size(1274, 826)
    s.set_page_load_timeout(45)
    s.get(url)

    WebDriverWait(s, wait_for_element).until(
        EC.presence_of_element_located((By.CLASS_NAME, "detailTABLE")))

    datasets = s.find_elements_by_class_name("detailTABLE")

    for item in datasets:
        print item.text

if __name__ == '__main__':
    test()

通过ajax请求检索数据,您可以从返回格式良好的json的文件中执行get:

json = requests.get("http://podaac.jpl.nasa.gov/dmasSolr/solr/dataset/select/?q=*:*&fl=Dataset-PersistentId,Dataset-ShortName-Full&rows=2147483647&fq=DatasetPolicy-AccessType-Full:(OPEN+OR+PREVIEW+OR+SIMULATED+OR+REMOTE)+AND+DatasetPolicy-ViewOnline:Y&wt=json").json()
print(json)

我们只需要用几个键拉:

^{pr2}$

输出片段:

[{'Dataset-PersistentId': 'PODAAC-MODST-M8D9N',
  'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_8DAY_9KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODST-MAN4N',
  'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_ANNUAL_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODSA-MMO9N',
  'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_MONTHLY_9KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODST-M1D9N',
  'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_DAILY_9KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-GHMTG-2PN01',
  'Dataset-ShortName-Full': 'NAVO-L2P-AVHRRMTA_G'},
 {'Dataset-PersistentId': 'PODAAC-GHBDM-4FD01',
  'Dataset-ShortName-Full': 'DMI-L4UHfnd-NSEABALTIC-DMI_OI'},
 {'Dataset-PersistentId': 'PODAAC-GHGOY-4FE01',
  'Dataset-ShortName-Full': 'EUR-L4HRfnd-GLOB-ODYSSEA'},
 {'Dataset-PersistentId': 'PODAAC-GHMED-4FE01',
  'Dataset-ShortName-Full': 'EUR-L4UHFnd-MED-v01'},
 {'Dataset-PersistentId': 'PODAAC-NSGDR-L2X02',
  'Dataset-ShortName-Full': 'NSCAT_LEVEL_2_V2'},
 {'Dataset-PersistentId': 'PODAAC-MODST-M1D4N',
  'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_DAILY_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODSA-MMO4N',
  'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_MONTHLY_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODST-MMO4N',
  'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_MONTHLY_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODSA-MAN9N',
  'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_ANNUAL_9KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODSA-M8D4N',
  'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_8DAY_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODSA-M1D4N',
  'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_DAILY_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-GOES3-24HOR',
  'Dataset-ShortName-Full': 'GOES_L3_SST_6km_NRT_SST_24HOUR'},

这将为您提供表中所有的数据集IDShort Name对,而不需要bs4。在

要获得ID,只需使用键Dataset-PersistentId访问每个dict:

for d in json["response"]["docs"]:
    print("ID for {Dataset-ShortName-Full} is {Dataset-PersistentId}".format(**d) )

一些输出:

ID for OSTM_L2_OST_OGDR_GPS is PODAAC-J2ODR-GPS00
ID for JPL-L4UHblend-NCAMERICA-RTO_SST_Ad is PODAAC-GHRAD-4FJ01
ID for SEAWINDS_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_IMAGES is PODAAC-SEABY-ANBIM
ID for SEAWINDS_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_MAPS_LITE is PODAAC-SEABY-ANBML
ID for CCMP_MEASURES_ATLAS_L4_OW_L3_5A_5DAY_WIND_VECTORS_FLK is PODAAC-CCF35-01AD5
ID for QSCAT_BYU_L3_OW_SIGMA0_ARCTIC_POLAR-STEREOGRAPHIC_BROWSE_MAPS_LITE is PODAAC-QSBYU-ARBML
ID for MODIS_AQUA_L3_SST_MID-IR_ANNUAL_4KM_NIGHTTIME is PODAAC-MODSA-MAN4N
ID for UCLA_DEALIASED_SASS_L3 is PODAAC-SASSX-L3UCD
ID for NSCAT_LEVEL_1.7_V2 is PODAAC-NSSDR-17X02
ID for NSCAT_LEVEL_3_V2 is PODAAC-NSJPL-L3X02
ID for AVHRR_NAVOCEANO_L3_18km_MCSST_DAYTIME is PODAAC-NAVOC-318DY
ID for QSCAT_L3_OW_JPL_BROWSE_IMAGES is PODAAC-QSXXX-L3BI0
ID for QSCAT_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_IMAGES is PODAAC-QSBYU-ANBIM
ID for NAVO-L4HR1m-GLOB-K10_SST is PODAAC-GHK10-41N01
ID for NCDC-L4LRblend-GLOB-AVHRR_AMSR_OI is PODAAC-GHAOI-4BC01
ID for SEAWINDS_LEVEL_3_V2 is PODAAC-SEAXX-L3X02

第二个ajax请求返回更多数据:

json = requests.get("http://podaac.jpl.nasa.gov/dmasSolr/solr/granule/select/?q=*&fq=Granule-AccessType:(OPEN+OR+PREVIEW+OR+SIMULATED+OR+REMOTE)+AND+Granule-Status:ONLINE&facet=true&facet.field=Dataset-ShortName-Full&rows=0&facet.limit=-1&facet.mincount=1&wt=json").json()
from pprint import pprint as pp

pp(json)

您还可以更改一些参数以提供不同的输出。在

相关问题 更多 >