非常不言自明。像我之前和之后的无数人一样,我在尝试调用model.fit()
时出现了一条Blas GEMM launch failed
错误消息
这是调用model.compile()
之前nvidia-smi
的输出:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 45C P0 74W / 149W | 0MiB / 11441MiB | 100% Default | <<<--- 0% Memory usage
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found | <<<--- nothing running
+-----------------------------------------------------------------------------+
以及在调用model.compile()
之后nvidia-smi
的输出model.fit()
之前:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 45C P0 72W / 149W | 10942MiB / 11441MiB | 0% Default | <<<--- 96% Memory usage
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1811 C /usr/bin/python3 10929MiB | <<<--- TF model here
+-----------------------------------------------------------------------------+
看起来编译的TensorFlow模型独占了96%的GPU内存。我不知道这是否正常,也不知道这是否可能是后来尝试训练模型时出现错误的原因
错误消息本身如下所示:
tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support
InternalError: Blas GEMM launch failed : a.shape=(32, 116032), b.shape=(116032, 256), m=32, n=256, k=116032 [[node dense_1/MatMul (defined at /home/ubuntu/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_1645]
Function call stack: keras_scratch_graph
tf.config.experimental.list_physical_devices()
的输出:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
该模型使用以下方法构建:
keras.models.Sequential
)我已经浏览了无数的GitHub问题、博客帖子、S.O.问题,所有这些都是为了确保在启动新进程时,GPU上没有以前运行的进程仍然处于活动状态,或者将CUPTI位置添加到LD_LIBRARY_路径,或者使用各种TF选项。。。所有这些都没有解决这个问题。如果您知道是什么原因导致了这种情况以及如何解决,我们将不胜感激
我也有同样的问题。我看到了许多答案,并使用了许多建议的代码来解决这个问题,但任何东西都可以帮助我
对我来说,问题在于GPU的使用,因此我用以下代码限制GPU使用的内存:
从https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth开始。这已经解决了我的问题。我希望这也能解决你的问题
相关问题 更多 >
编程相关推荐