Python:无法在webpag中下载selenium

2024-06-28 11:12:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我的目的是从https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa 下载一个zip文件 它是这个网页中的一个链接https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa。然后将它保存到这个目录"/home/vinvin/shKLSE/(我使用的是pythonaywhere)。然后将其解压缩,并在目录中提取csv文件。在

代码运行到最后没有错误,但它没有下载。 {/strong手动下载文件时,

我的代码和工作用户名和密码被使用。使用真实用户名和密码以便更容易理解问题。在

    #!/usr/bin/python
    print "hello from python 2"

    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    

    display = Display(visible=0, size=(800, 600))
    display.start()

    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk', '/zip')

    for retry in range(5):
        try:
            browser = webdriver.Firefox(profile)
            print "firefox"
            break
        except:
            time.sleep(3)
    time.sleep(1)

    browser.get("https://www.shareinvestor.com/my")
    time.sleep(10)
    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    username.send_keys("bkcollection")
    password.send_keys("123456")
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    time.sleep(30)

    browser.close()
    browser.quit()
    display.stop()

   zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')
   zip_ref.extractall(/home/vinvin/sh/KLSE)
   zip_ref.close()
   os.remove(zip_ref)

HTML片段:

^{pr2}$

请注意,当我复制代码段时会显示amp;amp。它对视图源代码是隐藏的,所以我想它是用JavaScript编写的。在

我发现的观察

  1. 即使我没有出错地运行代码,目录home/vinvin/shKLSE也不会创建

  2. 我尝试下载一个小得多的zip文件,它可以在一秒钟内完成,但在等待30秒后仍然没有下载。dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_daily&date=20170519&market=bursa']").click()

enter image description here


Tags: importbrowserbytimedownloadtypeloginelement
3条回答

我看不出你的代码块有什么大的缺点。但以下是通过此解决方案和执行此自动测试脚本的一些建议:

  1. 此代码在非市场时段运行完美。在市场时段,很多JavaScript&;Ajax Calls都在发挥作用,处理这些超出了这个问题的范围。在
  2. 如果您不能首先创建一个新的可用目录,请考虑创建一个新的目录。这个功能的代码块是Windows风格的,在Windows平台上运行非常完美。在
  3. 单击“Login”后,为htmldom引入一些wait以正确呈现。在
  4. 当您想停止下载过程时,您需要在FirefoxProfile中设置更多的首选项,如下面我的代码所述。在
  5. 始终考虑通过browser.maximize_window()最大化浏览器窗口
  6. 当您开始下载时,您需要等待足够的时间来完全下载文件。在
  7. 如果最后使用的是browser.quit(),则不需要使用browser.close()
  8. 您可以考虑将所有time.sleep()替换为ImplicitlyWait或{}或{}。在
  9. 下面是您自己的代码块,其中包含一些简单的调整:

    #!/usr/bin/python
    print "hello from python 2"
    
    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    
    
    display = Display(visible=0, size=(800, 600))
    display.start()
    
    newpath = 'C:\\home\\vivvin\\shKLSE'
    if not os.path.exists(newpath):
        os.makedirs(newpath)    
    
    profile = webdriver.FirefoxProfile()
    profile.set_preference("browser.download.dir",newpath);
    profile.set_preference("browser.download.folderList",2);
    profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/zip");
    profile.set_preference("browser.download.manager.showWhenStarting",False);
    profile.set_preference("browser.helperApps.neverAsk.openFile","application/zip");
    profile.set_preference("browser.helperApps.alwaysAsk.force", False);
    profile.set_preference("browser.download.manager.useWindow", False);
    profile.set_preference("browser.download.manager.focusWhenStarting", False);
    profile.set_preference("browser.helperApps.neverAsk.openFile", "");
    profile.set_preference("browser.download.manager.alertOnEXEOpen", False);
    profile.set_preference("browser.download.manager.showAlertOnComplete", False);
    profile.set_preference("browser.download.manager.closeWhenDone", True);
    profile.set_preference("pdfjs.disabled", True);
    
    for retry in range(5):
        try:
            browser = webdriver.Firefox(profile)
            print "firefox"
            break
        except:
            time.sleep(3)
    time.sleep(1)
    
    browser.maximize_window()
    browser.get("https://www.shareinvestor.com/my")
    time.sleep(10)
    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    time.sleep(10)
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    username.send_keys("bkcollection")
    password.send_keys("123456")
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    time.sleep(900)
    
    browser.close()
    browser.quit()
    display.stop()
    
    zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')
    zip_ref.extractall(/home/vinvin/sh/KLSE)
    zip_ref.close()
    os.remove(zip_ref)
    

如果这回答了你的问题,请告诉我。在

我重写了你的剧本,并用评论解释了我为什么要做出改变。我认为你的主要问题可能是一个糟糕的mimetype,然而,你的脚本有一个系统性问题的日志,最多也会使它不可靠。此重写使用显式等待,这完全消除了使用time.sleep()的需要,使其尽可能快地运行,同时也消除了网络拥塞引起的错误。在

您需要执行以下操作以确保安装了所有模块:

pip install requests explicit selenium retry pyvirtualdisplay

剧本:

#!/usr/bin/python

from __future__ import print_function  # Makes your code portable

import os
import glob
import zipfile
from contextlib import contextmanager

import requests
from retry import retry
from explicit import waiter, XPATH, ID
from selenium import webdriver
from pyvirtualdisplay import Display
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait

DOWNLOAD_DIR = "/tmp/shKLSE/"


def build_profile():
    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', DOWNLOAD_DIR)
    # I think your `/zip` mime type was incorrect. This works for me
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
                           'application/vnd.ms-excel,application/zip')

    return profile


# Retry is an elegant way to retry the browser creation
# Though you should narrow the scope to whatever the actual exception is you are
# retrying on
@retry(Exception, tries=5, delay=3)
@contextmanager  # This turns get_browser into a context manager
def get_browser():
    # Use a context manager with Display, so it will be closed even if an
    # exception is thrown
    profile = build_profile()
    with Display(visible=0, size=(800, 600)):
        browser = webdriver.Firefox(profile)
        print("firefox")
        try:
            yield browser
        finally:
            # Let a try/finally block manage closing the browser, even if an
            # exception is called
            browser.quit()


def main():
    print("hello from python 2")
    with get_browser() as browser:
        browser.get("https://www.shareinvestor.com/my")

        # Click the login button
        # waiter is a helper function that makes it easy to use explicit waits
        # with it you dont need to use time.sleep() calls at all
        login_xpath = '//*/div[@class="sic_logIn-bg"]/a'
        waiter.find_element(browser, login_xpath, XPATH).click()
        print(browser.current_url)

        # Log in
        username = "bkcollection"
        username_id = "sic_login_header_username"
        password = "123456"
        password_id = "sic_login_header_password"
        waiter.find_write(browser, username_id, username, by=ID)
        waiter.find_write(browser, password_id, password, by=ID, send_enter=True)

        # Wait for login process to finish by locating an element only found
        # after logging in, like the Logged In Nav
        nav_id = 'sic_loggedInNav'
        waiter.find_element(browser, nav_id, ID)

        print("log in done")

        # Load the target page
        target_url = ("https://www.shareinvestor.com/prices/price_download.html#/?"
                      "type=price_download_all_stocks_bursa")
        browser.get(target_url)
        print(browser.current_url)

        # CLick download button
        all_data_xpath = ("//*[@href='/prices/price_download_zip_file.zip?"
                          "type=history_all&market=bursa']")
        waiter.find_element(browser, all_data_xpath, XPATH).click()

        # This is a bit challenging: You need to wait until the download is complete
        # This file is 220 MB, it takes a while to complete. This method waits until
        # there is at least one file in the dir, then waits until there are no
        # filenames that end in `.part`
        # Note that is is problematic if there is already a file in the target dir. I
        # suggest looking into using the tempdir module to create a unique, temporary
        # directory for downloading every time you run your script
        print("Waiting for download to complete")
        at_least_1 = lambda x: len(x("{0}/*.zip*".format(DOWNLOAD_DIR))) > 0
        WebDriverWait(glob.glob, 300).until(at_least_1)

        no_parts = lambda x: len(x("{0}/*.part".format(DOWNLOAD_DIR))) == 0
        WebDriverWait(glob.glob, 300).until(no_parts)

        print("Download Done")

        # Now do whatever it is you need to do with the zip file
        # zip_ref = zipfile.ZipFile(DOWNLOAD_DIR, 'r')
        # zip_ref.extractall(DOWNLOAD_DIR)
        # zip_ref.close()
        # os.remove(zip_ref)

        print("Done!")


if __name__ == "__main__":
    main()

完全公开:我维护显式模块。它的目的是使使用显式等待变得更容易,就像这样的情况下,网站缓慢地加载基于用户交互的动态内容。您可以将上面所有的waiter.XXX调用替换为直接显式等待。在

把它从硒的范围之外拿出来。更改首选项设置,以便在单击链接时(首先检查链接是否有效)会弹出一个请求保存的弹出窗口,现在使用sikulihttp://www.sikuli.org/单击弹出窗口。 Mime类型并不总是有效的,也没有黑与白的答案来解释它为什么不工作。在

相关问题 更多 >