Tensorflow在使用GPU时引发异常

2024-10-04 09:25:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试加速我用keras构建的模型,在cuda库版本遇到一些困难后,我设法让tensorflow检测我的GPU。但是现在,当我在检测到GPU的情况下运行模型时,它会失败,并进行以下回溯:

2021-01-20 17:40:26.549946: W tensorflow/core/common_runtime/bfc_allocator.cc:441] ****___*********____________________________________________________________________________________ Traceback (most recent call last): File "model.py", line 72, in <module> history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=2, validation_data=(x_val, y_val)) File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit tmp_logs = self.train_function(iterator) File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__ result = self._call(*args, **kwds) File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call return self._stateless_fn(*args, **kwds) File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__ return graph_function._call_flat( File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat return self._build_call_outputs(self._inference_function.call( File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call outputs = execute.execute( File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.ResourceExhaustedError: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;ccc21c10a2feabe0;/job:localhost/replica:0/task:0/device:GPU:0;edge_17_IteratorGetNext;0:0 [[{{node IteratorGetNext/_2}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_train_function_875] Function call stack: train_function

该模型仅在CPU上运行良好

我不确定这是否与版本控制有关,但我会详细说明情况。我正在运行gentoo,但是由于tensorflow软件包的编译工作量很大,我已经通过pip下载了一个二进制软件包,它的版本为2.4.0。我已经通过发行版的包管理器安装了最新的nvidia-cuda-toolkit包和cudnn,但是当我这样做并测试tensorflow是否检测到我的GPU时,它说它找不到libcusolver.so.10,而我通过最新版本安装了libcusolver.so.11。我试图降级到cuda toolkit的一个版本,它有libcusolver.so.10,但是tensorflow会抱怨找不到其他几个版本11库,所以我安装了最新的cuda toolkit包,但在/opt/cuda/lib64目录中也包含了旧的libcusolver.so.10文件。我知道这是一个黑客的解决方案,但我不知道我还能做什么,如果这是它正在寻找的

以下是我使用keras的完整模型代码:

model = Sequential() model.add(Conv2D(8, (7,7), activation='relu', input_shape=input_shape)) model.add(MaxPooling2D(pool_size=(2,2))) model.add(Dropout(0.25)) model.add(Conv2D(16, (7,7), activation='relu')) model.add(MaxPooling2D(pool_size=(2,2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(64, activation='relu')) model.add(Dropout(0.25)) model.add(Dense(num_classes, activation='softmax')) model.summary() batch_size = 1000 epochs = 100 model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adam(learning_rate=0.001), metrics=['accuracy']) history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=2, validation_data=(x_val, y_val))

Tags: inpyaddhomesizemodelliblocal