使用selenium和BeautifulSoup获取页面的可见内容

driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities) driver.get(url) dom = BeautifulSoup(driver.page_source, parser) f = dom.find('iframe', id='dsq-app1') driver.switch_to_frame('dsq-app1') s = driver.page_source f.replace_with(BeautifulSoup(s, 'html.parser')) with open('out.html', 'w') as fe: fe.write(dom.encode('utf-8'))

3条回答

网友

1楼 · 编辑于 2024-06-26 14:10:38

为了补充段或段的回答，我提供了我最终做了什么。查找页面或页面的某个部分是否已完全加载的问题是一个复杂的问题。我尝试使用隐式和显式等待，但最后还是收到了半加载的帧。我的解决方法是检查原始文档的readyState和iframes的readyState。在

下面是一个示例函数

def _check_if_load_complete(driver, timeout=10):
    elapsed_time = 1
    while True:
        if (driver.execute_script('return document.readyState') == 'complete' or
                elapsed_time == timeout):
            break
        else:
            sleep(0.0001)
        elapsed_time += 1

然后我在把驱动程序的焦点改为iframe之后使用了这个函数

^{pr2}$

网友

2楼 · 编辑于 2024-06-26 14:10:38

听起来，当您的代码试图访问dom元素时，它们还没有被加载。在

尝试wait使元素完全加载，然后替换。在

当你一个一个命令地运行它时，这种方法很有效，因为这样你就可以让驱动程序在执行更多命令之前加载所有元素。在

网友

3楼 · 编辑于 2024-06-26 14:10:38

在检测到所需的ID/CSS\u选择器/类或链接后，尝试获取页面源代码。在

您可以始终使用显式等待Selenium WebDriver。在

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
f = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,idName) 
# here 10 is time for which script will try to find given id
# provide the id name
dom = BeautifulSoup(driver.page_source, parser)

f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))

with open('out.html', 'w') as fe:
    fe.write(dom.encode('utf-8'))

如果这不管用，请纠正我

相关问题更多 >

编程相关推荐

热门问题

热门文章