Tensorflow驱动程序内部问题:在./libdevice.10.bc上找不到libdevice

2024-09-30 20:33:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试在集群上运行基于tensorflow的项目,我在anaconda环境中安装了所有相关依赖项,与在项目运行的本地计算机上安装的方式完全相同,但我收到以下错误消息:

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: libdevice not found at ./libdevice.10.bc
         [[{{node cluster_2_1/xla_compile}}]]
         [[cluster_1_1/merge_oidx_20/_1]]
  (1) Internal: libdevice not found at ./libdevice.10.bc
         [[{{node cluster_2_1/xla_compile}}]]

完全回溯-https://pastebin.com/njqNFWvC

/u/usr/anaconda3/envs/Project_BM/lib/内,我可以看到有问题的libdevice.10.bc

2021-06-30 08:27:50.484735: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:69] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
2021-06-30 08:27:50.484775: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:70] Searched for CUDA in the following directories:
2021-06-30 08:27:50.484781: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73]   ./cuda_sdk_lib
2021-06-30 08:27:50.484784: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73]   /usr/local/cuda
2021-06-30 08:27:50.484787: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73]   .
2021-06-30 08:27:50.484791: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:75] You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work. 

回溯的这一部分让我认为tensorflow是在本地而不是在conda环境中搜索cuda,要解决这个问题,我需要将XLA_标志设置为/u/usr/anaconda3/envs/Project_BM/lib/libdevice.10.bc,如果不需要,我在哪里可以找到Project_BM环境中的/cuda/目录

还值得知道的是,我正在集群上运行此操作,因此我没有根权限


Tags: thecompilergpu环境usrtensorflowservicecuda