如何获得完整的网页信息它有3个部分

2024-10-04 09:26:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在努力刮网页

https://www-nass.nhtsa.dot.gov/nass/cds/CaseForm.aspx?xsl=main.xsl&CaseID=773013618

它有三个部门。当我手动检查视图源时,我只得到一个有光标的分区数据。带代码

 driver = webdriver.Ie()
 driver.get('https://www-nass.nhtsa.dot.gov/nass/cds/CaseForm.aspx?xsl=main.xsl&CaseID=773013618')
 content = driver.page_source

PAGEU源还提供了一个分区数据

如果我试着用

    driver.switch_to_frame(1)

我得到一个错误没有这样的帧可用。网站使用JavaScript

有什么帮助吗


Tags: 数据httpsmainwwwdriverdotgov分区
2条回答

你已经清楚地观察到有3 divisionsTop Window2 frames,因此我们可以得到Top Windowpage source,然后穿过2 frames来刮取page source,如下所示:

from selenium import webdriver
driver = webdriver.Ie(r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get(r'https://www-nass.nhtsa.dot.gov/nass/cds/CaseForm.aspx?xsl=main.xsl&CaseID=773013618')
content = driver.page_source
print("Content on Top Window is :")
print(content)
multiple_frames = driver.find_elements_by_xpath('//iframe')
print("There are " +str(len(multiple_frames)) +" frames")
for frame_name in multiple_frames:
    print("Content on "+frame_name.get_attribute("name")+" frame is : ")
    driver.switch_to.frame(frame_name)
    sub_content = driver.page_source
    print(sub_content)
    driver.switch_to.default_content()
driver.quit()

控制台上的输出是:

Content on Top Window is :
<html xmlns:saxon="http://saxon.sf.net/" xmlns="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dot="http://www.volpe.dot.gov" xml:lang="en"><head>
      <title>NASS Case Viewer - CaseID:773013618</title>
      <link id="StyleOut" type="text/css" rel="stylesheet" title="output" href="StyleOut.css" /><script src="main.js"></script></head>
   <body onload="javascript:init('True','/NASS/CDS/XSLT/','773013618','case.xsl','CaseForm','Crash')">
...
...
...
</body></html>
There are 2 frames
Content on menu frame is : 
<html xmlns:svg="http://www.w3.org/2000/svg" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:dot="http://www.volpe.dot.gov" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head>
      <meta http-equiv="Content-Script-Type" />
      <title>menu</title>
...
...
...
                </script></head></html>
Content on viewer frame is : 
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:svg="http://www.w3.org/2000/svg-20000303-stylable" xmlns:fn="http://www.w3.org/2005/02/xpath-functions"><head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
      <title>Case</title>
      <link id="StyleOut" type="text/css" rel="stylesheet" title="output" href="StyleOut.css" />
   </head>
   <body id="bodyMain">
...
...
...
</body></html>

您的页面有两个框架,您有名称和id。您可以与其中任何一个进行切换

  driver.switch_to.frame(driver.find_element_by_name('menu'))

或者

 driver.switch_to.frame(driver.find_element_by_name('viewer'))

使用driver.switch_to.default_content()切换到默认值

enter image description here

相关问题 更多 >