无法从多个线程运行Tensorflow模型

2024-09-30 02:34:15 发布

您现在位置：Python中文网/ 问答频道 /正文

9547

网友

男 | 程序猿一只，喜欢编程写python代码。

我在做一个项目，需要同时加载和运行不同的神经网络。你知道吗

我用来测试代码的模型取自DeepLab Demo，我基本上将它们的代码封装在一个类（称为DeepLabModel）中，并为它们提出的每个不同模型实例化一次。你知道吗

到目前为止，我编写了一个版本的代码，其中模型是从同一个过程中依次加载（和使用）的，而且一切都运行良好。你知道吗

因为我对结果做了一些处理，并且我需要模拟一个分布式环境，所以我需要并行化包含模型的每个类。你知道吗

我的第一个版本是一个类Agent，它得到了一个以前加载的DeepLabModel实例作为参数。每个代理都有一个“predict”函数，它在不同的进程中执行，但我注意到代理挂在会话.run（）函数（在DeepLabModel.run()函数内），没有任何输出。你知道吗

由于找不到原因，我尝试重写代码，使得现在每个Agent只获取每个模型的文件名，我编写了一个函数run_agent()，在该函数中加载模型，在队列中等待输入图像，并在收到的输入上运行模型。你知道吗

下面是上一个版本的一些代码（仅相关部分，DeepLabModel只是上面提供的链接中代码的包装器）：

import DeepLabModel

class Master():
    def __init__(self, agents, timeout=10):
        self.agents = agents # Reference to agents
        # Message queues to communicate with the agents
        self.output_queues = [Queue() for a in agents]
        self.input_queues =  [Queue() for a in agents]
        # Processes simulating remote agents    
        self.agentpool = [Process(target=a.run_agent, 
                                  args=(self.output_queues[a_id], self.input_queues[a_id])) for a_id, a in enumerate(self.agents)]
        for a in self.agentpool:
            a.start()
        print("Agents spawned")



class Agent():
    def __init__(self, agentname, model_name):
        self.agentname=agentname
        self.model_name = model_name
        self.model = None

    def load_model(self):
        # .... basically the same code contained in DeepLab notebook
        # ...here i use self.model_name and download the model...
        self.model = DeepLabModel.DeepLabModel(download_path)


    def run_agent(self, inqueue, outqueue):
        self.inqueue = inqueue
        self.outqueue = outqueue
        self.load_model()
        # The first element we expect is the task
        image = self.inqueue.get()
        result = self.model.run(image) # Here the program hangs/crashes
        # Asnwer back
        self.outqueue.put(result)
        # ...

问题是，当进程尝试执行tf.run()时，每个进程都会失败并出现错误： tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

但是没有其他的警告被打印出来。我试着在gpu和cpu上运行这个。我正在运行一个docker映像（nvidiadocker），tf1.12，如果我不尝试从不同的进程运行tensorflow，一切都正常。你知道吗

我还想知道，如果先加载模型，然后将其传递给进程（每个进程一个），为什么代码会挂起。你知道吗

先谢谢你。你知道吗

Tags： the to 函数 run 代码 in 模型 self

0条回答

目前没有回答

无法从多个线程运行Tensorflow模型

相关问题更多 >

编程相关推荐

热门问题

热门文章

无法从多个线程运行Tensorflow模型

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >