<p>我重写了你的剧本,并用评论解释了我为什么要做出改变。我认为你的主要问题可能是一个糟糕的mimetype,然而,你的脚本有一个系统性问题的日志,最多也会使它不可靠。此重写使用显式等待,这完全消除了使用<code>time.sleep()</code>的需要,使其尽可能快地运行,同时也消除了网络拥塞引起的错误。在</p>
<p>您需要执行以下操作以确保安装了所有模块:</p>
<p><code>pip install requests explicit selenium retry pyvirtualdisplay</code></p>
<p>剧本:</p>
<pre><code>#!/usr/bin/python
from __future__ import print_function # Makes your code portable
import os
import glob
import zipfile
from contextlib import contextmanager
import requests
from retry import retry
from explicit import waiter, XPATH, ID
from selenium import webdriver
from pyvirtualdisplay import Display
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
DOWNLOAD_DIR = "/tmp/shKLSE/"
def build_profile():
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', DOWNLOAD_DIR)
# I think your `/zip` mime type was incorrect. This works for me
profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
'application/vnd.ms-excel,application/zip')
return profile
# Retry is an elegant way to retry the browser creation
# Though you should narrow the scope to whatever the actual exception is you are
# retrying on
@retry(Exception, tries=5, delay=3)
@contextmanager # This turns get_browser into a context manager
def get_browser():
# Use a context manager with Display, so it will be closed even if an
# exception is thrown
profile = build_profile()
with Display(visible=0, size=(800, 600)):
browser = webdriver.Firefox(profile)
print("firefox")
try:
yield browser
finally:
# Let a try/finally block manage closing the browser, even if an
# exception is called
browser.quit()
def main():
print("hello from python 2")
with get_browser() as browser:
browser.get("https://www.shareinvestor.com/my")
# Click the login button
# waiter is a helper function that makes it easy to use explicit waits
# with it you dont need to use time.sleep() calls at all
login_xpath = '//*/div[@class="sic_logIn-bg"]/a'
waiter.find_element(browser, login_xpath, XPATH).click()
print(browser.current_url)
# Log in
username = "bkcollection"
username_id = "sic_login_header_username"
password = "123456"
password_id = "sic_login_header_password"
waiter.find_write(browser, username_id, username, by=ID)
waiter.find_write(browser, password_id, password, by=ID, send_enter=True)
# Wait for login process to finish by locating an element only found
# after logging in, like the Logged In Nav
nav_id = 'sic_loggedInNav'
waiter.find_element(browser, nav_id, ID)
print("log in done")
# Load the target page
target_url = ("https://www.shareinvestor.com/prices/price_download.html#/?"
"type=price_download_all_stocks_bursa")
browser.get(target_url)
print(browser.current_url)
# CLick download button
all_data_xpath = ("//*[@href='/prices/price_download_zip_file.zip?"
"type=history_all&market=bursa']")
waiter.find_element(browser, all_data_xpath, XPATH).click()
# This is a bit challenging: You need to wait until the download is complete
# This file is 220 MB, it takes a while to complete. This method waits until
# there is at least one file in the dir, then waits until there are no
# filenames that end in `.part`
# Note that is is problematic if there is already a file in the target dir. I
# suggest looking into using the tempdir module to create a unique, temporary
# directory for downloading every time you run your script
print("Waiting for download to complete")
at_least_1 = lambda x: len(x("{0}/*.zip*".format(DOWNLOAD_DIR))) > 0
WebDriverWait(glob.glob, 300).until(at_least_1)
no_parts = lambda x: len(x("{0}/*.part".format(DOWNLOAD_DIR))) == 0
WebDriverWait(glob.glob, 300).until(no_parts)
print("Download Done")
# Now do whatever it is you need to do with the zip file
# zip_ref = zipfile.ZipFile(DOWNLOAD_DIR, 'r')
# zip_ref.extractall(DOWNLOAD_DIR)
# zip_ref.close()
# os.remove(zip_ref)
print("Done!")
if __name__ == "__main__":
main()
</code></pre>
<p>完全公开:我维护显式模块。它的目的是使使用显式等待变得更容易,就像这样的情况下,网站缓慢地加载基于用户交互的动态内容。您可以将上面所有的<code>waiter.XXX</code>调用替换为直接显式等待。在</p>