python3.4中的Selenium多处理帮助

from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import time import os import pandas as pd from multiprocessing import Pool def websitesearch(search): try: start = list_of_inputs[0] end = list_of_inputs[1] newsabbv=list_of_inputs[2] directory=list_of_inputs[3] os.chdir(directory) if search == broad: specification = "broad" relPapers = newsabbv elif search == narrow: specification = "narrow" relPapers = newsabbv elif search == general: specification = "allarticles" relPapers = newsabbv else: for newspapers in relPapers: ...rest of code here that gets the data and puts it in a list named all_Data... browser.close() df = pd.DataFrame(all_Data) df.to_csv(filename, index=False) except: print('error with item') if __name__ == '__main__': ...Initializing values and things like that go here. This helps with the setup for search... #These are things that go into the function start = ["January",2015] end = ["August",2017] directory = "STUFF GOES HERE" newsabbv = all_news_abbv search_list = [narrow, broad, general] list_of_inputs = [start,end,newsabbv,directory] pool = Pool(processes=4) for search in search_list: pool.map(websitesearch, search_list) print(list_of_inputs)

1条回答

网友

1楼 · 发布于 2024-10-02 10:33:16

我不认为用你的方式是多处理的。因为selenium仍然有队列进程（不是队列模块）。在

原因是…selenium只能处理一个窗口，不能同时处理多个窗口或选项卡浏览器（窗口句柄功能的限制）。这意味着……你的多进程只处理内存中发送给selenium或selenium爬网的数据进程。通过try-process在一个脚本文件中爬行硒，将使selenium成为瓶颈进程的源代码。在

实现真正的多进程的最佳方法是：

创建一个脚本，该脚本使用selenium来处理该url，并将其保存为文件。例如爬虫.py并确保脚本有print命令来打印结果

例如：

import -> all modules that you need to run selenium
import sys

url = sys.argv[1] #you will catch the url 

driver = ......#open browser

driver.get(url)
#just continue the script base on your method

print( the result that you want )
sys.exit(0)

我可以给出更多的解释，因为这是整个过程的主要核心，你想在网上做什么，只有你明白。在

创建另一个脚本文件：

a.把网址设为“多进程”是指创建一些进程并与所有cpu核心一起运行，这是使其成为。。。它是由devide输入过程开始的，在你的例子中可能是url目标（你没有给我们，你想要爬网的网站目标）。但是网站的每个页面都有不同的url。只需收集所有url并将其分为几个组（最佳实践：您的cpu核心-1）

例如：

^{pr2}$

b.使用py.py爬行之前已经完成的（按子流程或其他模块，例如：操作系统). 确保运行爬网.py最大值==cpucore。在

例如：

crawler = r'YOUR FILE DIRECTORY\crawler.py'

def devideurl():
    global url1, url2, url3, url4
    make script that result:
    urls1 = groups or list of url
    urls2 = groups or list of url
    urls3 = groups or list of url
    urls4 = groups or list of url

def target1():
    for url in url1:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...
        #do you see the combination between python crawler and url?
        #the cmd command will be: python crawler.py "value", the "value" is captured by sys.argv[1] command in crawler.py

def target2():
    for url in url2:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...
def target3():
    for url in url1:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...
def target4():
    for url in url2:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...

cpucore = int(mp.cpu_count())-1
pool = Pool(processes="max is the value of cpucore")
for search in search_list:
    pool.map(target1, devideurl)
    pool.map(target2, devideurl)
    pool.map(target3, devideurl)
    pool.map(target4, devideurl)
    #you can make it, more, depend on your cpu core

c.将打印结果存入主脚本内存

继续你的脚本过程来处理你已经得到的数据。在

最后，在主脚本中为整个进程生成多进程脚本。在

使用此方法：

可以打开多个浏览器窗口同时处理，由于从网站抓取的数据处理速度慢于内存中的数据处理，这种方法至少减少了数据流的瓶颈。意味着它比你以前的方法更快。在

希望能帮上忙…干杯

相关问题更多 >

编程相关推荐

热门问题

热门文章