为什么我在集群中运行的stylegan代码在训练了3765.5k(总共25000个)图像之后没有生成输出或日志?

2024-05-07 20:00:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我将stylegan代码(https://github.com/NVlabs/stylegan)下载到我的GPU集群,该集群由torque和nvidia docker管理。我唯一编辑过的地方就是配置.py文件,以输入正确的目录,所有其他设置是默认的。然而,当我将我的程序提交给torque时,使用正确设置的dockerimages,它只在前20小时正确地生成日志、快照和其他输出(当训练处理3765.5k图像时,当前分辨率=64*64),并且在之后的300小时内它从未生成任何输出。3765.5k图像的快照是正确的,程序运行在8个nivida K80 gpu中,每个gpu有12G内存,训练期间没有内存不足的消息或错误或警告。你知道吗

我创造了一个中止.txt'在run_dir中,一旦检测到更新进度可能会停止训练过程,但程序在过去4小时内没有停止。你知道吗

我把训练每个分辨率的图像数减少到2k,它成功地产生1024*1024的输出。你知道吗

我运行了4次代码,我也从github重新下载它,这个问题总是发生。你知道吗

我还减少了滴答图像到2k,程序成功产生60小时的输出,由于时间限制,我不得不停止它,我不知道它是否会使3765.5k后的图像。你知道吗

Here is the run_dir files

这是你的名字日志.txt公司名称:

dnnlib: Running training.training_loop.training_loop() on localhost...
Streaming data using training.dataset.TFRecordDataset...
Dataset shape = [3, 1024, 1024]
Dynamic range = [0, 255]
Label size    = 0
Constructing networks...

G                               Params    OutputShape          WeightShape     
---                             ---       ---                  ---             
latents_in                      -         (?, 512)             -               
labels_in                       -         (?, 0)               -               
lod                             -         ()                   -               
dlatent_avg                     -         (512,)               -               
G_mapping/latents_in            -         (?, 512)             -               
G_mapping/labels_in             -         (?, 0)               -               
G_mapping/PixelNorm             -         (?, 512)             -               
G_mapping/Dense0                262656    (?, 512)             (512, 512)      
G_mapping/Dense1                262656    (?, 512)             (512, 512)      
G_mapping/Dense2                262656    (?, 512)             (512, 512)      
G_mapping/Dense3                262656    (?, 512)             (512, 512)      
G_mapping/Dense4                262656    (?, 512)             (512, 512)      
G_mapping/Dense5                262656    (?, 512)             (512, 512)      
G_mapping/Dense6                262656    (?, 512)             (512, 512)      
G_mapping/Dense7                262656    (?, 512)             (512, 512)      
G_mapping/Broadcast             -         (?, 18, 512)         -               
G_mapping/dlatents_out          -         (?, 18, 512)         -               
Truncation                      -         (?, 18, 512)         -               
G_synthesis/dlatents_in         -         (?, 18, 512)         -               
G_synthesis/4x4/Const           534528    (?, 512, 4, 4)       (512,)          
G_synthesis/4x4/Conv            2885632   (?, 512, 4, 4)       (3, 3, 512, 512)
G_synthesis/ToRGB_lod8          1539      (?, 3, 4, 4)         (1, 1, 512, 3)  
G_synthesis/8x8/Conv0_up        2885632   (?, 512, 8, 8)       (3, 3, 512, 512)
G_synthesis/8x8/Conv1           2885632   (?, 512, 8, 8)       (3, 3, 512, 512)
G_synthesis/ToRGB_lod7          1539      (?, 3, 8, 8)         (1, 1, 512, 3)  
G_synthesis/Upscale2D           -         (?, 3, 8, 8)         -               
G_synthesis/Grow_lod7           -         (?, 3, 8, 8)         -               
G_synthesis/16x16/Conv0_up      2885632   (?, 512, 16, 16)     (3, 3, 512, 512)
G_synthesis/16x16/Conv1         2885632   (?, 512, 16, 16)     (3, 3, 512, 512)
G_synthesis/ToRGB_lod6          1539      (?, 3, 16, 16)       (1, 1, 512, 3)  
G_synthesis/Upscale2D_1         -         (?, 3, 16, 16)       -               
G_synthesis/Grow_lod6           -         (?, 3, 16, 16)       -               
G_synthesis/32x32/Conv0_up      2885632   (?, 512, 32, 32)     (3, 3, 512, 512)
G_synthesis/32x32/Conv1         2885632   (?, 512, 32, 32)     (3, 3, 512, 512)
G_synthesis/ToRGB_lod5          1539      (?, 3, 32, 32)       (1, 1, 512, 3)  
G_synthesis/Upscale2D_2         -         (?, 3, 32, 32)       -               
G_synthesis/Grow_lod5           -         (?, 3, 32, 32)       -               
G_synthesis/64x64/Conv0_up      1442816   (?, 256, 64, 64)     (3, 3, 512, 256)
G_synthesis/64x64/Conv1         852992    (?, 256, 64, 64)     (3, 3, 256, 256)
G_synthesis/ToRGB_lod4          771       (?, 3, 64, 64)       (1, 1, 256, 3)  
G_synthesis/Upscale2D_3         -         (?, 3, 64, 64)       -               
G_synthesis/Grow_lod4           -         (?, 3, 64, 64)       -               
G_synthesis/128x128/Conv0_up    426496    (?, 128, 128, 128)   (3, 3, 256, 128)
G_synthesis/128x128/Conv1       279040    (?, 128, 128, 128)   (3, 3, 128, 128)
G_synthesis/ToRGB_lod3          387       (?, 3, 128, 128)     (1, 1, 128, 3)  
G_synthesis/Upscale2D_4         -         (?, 3, 128, 128)     -               
G_synthesis/Grow_lod3           -         (?, 3, 128, 128)     -               
G_synthesis/256x256/Conv0_up    139520    (?, 64, 256, 256)    (3, 3, 128, 64) 
G_synthesis/256x256/Conv1       102656    (?, 64, 256, 256)    (3, 3, 64, 64)  
G_synthesis/ToRGB_lod2          195       (?, 3, 256, 256)     (1, 1, 64, 3)   
G_synthesis/Upscale2D_5         -         (?, 3, 256, 256)     -               
G_synthesis/Grow_lod2           -         (?, 3, 256, 256)     -               
G_synthesis/512x512/Conv0_up    51328     (?, 32, 512, 512)    (3, 3, 64, 32)  
G_synthesis/512x512/Conv1       42112     (?, 32, 512, 512)    (3, 3, 32, 32)  
G_synthesis/ToRGB_lod1          99        (?, 3, 512, 512)     (1, 1, 32, 3)   
G_synthesis/Upscale2D_6         -         (?, 3, 512, 512)     -               
G_synthesis/Grow_lod1           -         (?, 3, 512, 512)     -               
G_synthesis/1024x1024/Conv0_up  21056     (?, 16, 1024, 1024)  (3, 3, 32, 16)  
G_synthesis/1024x1024/Conv1     18752     (?, 16, 1024, 1024)  (3, 3, 16, 16)  
G_synthesis/ToRGB_lod0          51        (?, 3, 1024, 1024)   (1, 1, 16, 3)   
G_synthesis/Upscale2D_7         -         (?, 3, 1024, 1024)   -               
G_synthesis/Grow_lod0           -         (?, 3, 1024, 1024)   -               
G_synthesis/images_out          -         (?, 3, 1024, 1024)   -               
G_synthesis/lod                 -         ()                   -               
G_synthesis/noise0              -         (1, 1, 4, 4)         -               
G_synthesis/noise1              -         (1, 1, 4, 4)         -               
G_synthesis/noise2              -         (1, 1, 8, 8)         -               
G_synthesis/noise3              -         (1, 1, 8, 8)         -               
G_synthesis/noise4              -         (1, 1, 16, 16)       -               
G_synthesis/noise5              -         (1, 1, 16, 16)       -               
G_synthesis/noise6              -         (1, 1, 32, 32)       -               
G_synthesis/noise7              -         (1, 1, 32, 32)       -               
G_synthesis/noise8              -         (1, 1, 64, 64)       -               
G_synthesis/noise9              -         (1, 1, 64, 64)       -               
G_synthesis/noise10             -         (1, 1, 128, 128)     -               
G_synthesis/noise11             -         (1, 1, 128, 128)     -               
G_synthesis/noise12             -         (1, 1, 256, 256)     -               
G_synthesis/noise13             -         (1, 1, 256, 256)     -               
G_synthesis/noise14             -         (1, 1, 512, 512)     -               
G_synthesis/noise15             -         (1, 1, 512, 512)     -               
G_synthesis/noise16             -         (1, 1, 1024, 1024)   -               
G_synthesis/noise17             -         (1, 1, 1024, 1024)   -               
images_out                      -         (?, 3, 1024, 1024)   -               
---                             ---       ---                  ---             
Total                           26219627                                       


D                     Params    OutputShape          WeightShape     
---                   ---       ---                  ---             
images_in             -         (?, 3, 1024, 1024)   -               
labels_in             -         (?, 0)               -               
lod                   -         ()                   -               
FromRGB_lod0          64        (?, 16, 1024, 1024)  (1, 1, 3, 16)   
1024x1024/Conv0       2320      (?, 16, 1024, 1024)  (3, 3, 16, 16)  
1024x1024/Conv1_down  4640      (?, 32, 512, 512)    (3, 3, 16, 32)  
Downscale2D           -         (?, 3, 512, 512)     -               
FromRGB_lod1          128       (?, 32, 512, 512)    (1, 1, 3, 32)   
Grow_lod0             -         (?, 32, 512, 512)    -               
512x512/Conv0         9248      (?, 32, 512, 512)    (3, 3, 32, 32)  
512x512/Conv1_down    18496     (?, 64, 256, 256)    (3, 3, 32, 64)  
Downscale2D_1         -         (?, 3, 256, 256)     -               
FromRGB_lod2          256       (?, 64, 256, 256)    (1, 1, 3, 64)   
Grow_lod1             -         (?, 64, 256, 256)    -               
256x256/Conv0         36928     (?, 64, 256, 256)    (3, 3, 64, 64)  
256x256/Conv1_down    73856     (?, 128, 128, 128)   (3, 3, 64, 128) 
Downscale2D_2         -         (?, 3, 128, 128)     -               
FromRGB_lod3          512       (?, 128, 128, 128)   (1, 1, 3, 128)  
Grow_lod2             -         (?, 128, 128, 128)   -               
128x128/Conv0         147584    (?, 128, 128, 128)   (3, 3, 128, 128)
128x128/Conv1_down    295168    (?, 256, 64, 64)     (3, 3, 128, 256)
Downscale2D_3         -         (?, 3, 64, 64)       -               
FromRGB_lod4          1024      (?, 256, 64, 64)     (1, 1, 3, 256)  
Grow_lod3             -         (?, 256, 64, 64)     -               
64x64/Conv0           590080    (?, 256, 64, 64)     (3, 3, 256, 256)
64x64/Conv1_down      1180160   (?, 512, 32, 32)     (3, 3, 256, 512)
Downscale2D_4         -         (?, 3, 32, 32)       -               
FromRGB_lod5          2048      (?, 512, 32, 32)     (1, 1, 3, 512)  
Grow_lod4             -         (?, 512, 32, 32)     -               
32x32/Conv0           2359808   (?, 512, 32, 32)     (3, 3, 512, 512)
32x32/Conv1_down      2359808   (?, 512, 16, 16)     (3, 3, 512, 512)
Downscale2D_5         -         (?, 3, 16, 16)       -               
FromRGB_lod6          2048      (?, 512, 16, 16)     (1, 1, 3, 512)  
Grow_lod5             -         (?, 512, 16, 16)     -               
16x16/Conv0           2359808   (?, 512, 16, 16)     (3, 3, 512, 512)
16x16/Conv1_down      2359808   (?, 512, 8, 8)       (3, 3, 512, 512)
Downscale2D_6         -         (?, 3, 8, 8)         -               
FromRGB_lod7          2048      (?, 512, 8, 8)       (1, 1, 3, 512)  
Grow_lod6             -         (?, 512, 8, 8)       -               
8x8/Conv0             2359808   (?, 512, 8, 8)       (3, 3, 512, 512)
8x8/Conv1_down        2359808   (?, 512, 4, 4)       (3, 3, 512, 512)
Downscale2D_7         -         (?, 3, 4, 4)         -               
FromRGB_lod8          2048      (?, 512, 4, 4)       (1, 1, 3, 512)  
Grow_lod7             -         (?, 512, 4, 4)       -               
4x4/MinibatchStddev   -         (?, 513, 4, 4)       -               
4x4/Conv              2364416   (?, 512, 4, 4)       (3, 3, 513, 512)
4x4/Dense0            4194816   (?, 512)             (8192, 512)     
4x4/Dense1            513       (?, 1)               (512, 1)        
scores_out            -         (?, 1)               -               
---                   ---       ---                  ---             
Total                 23087249                                       

Building TensorFlow graph...
Setting up snapshot image grid...
Setting up run dir...
Training...

tick 1     kimg 140.3    lod 7.00  minibatch 256  time 16m 19s      sec/tick 418.8   sec/kimg 2.99    maintenance 560.3  gpumem 3.8 
network-snapshot-000140        time 4m 57s       fid50k 411.7487  
tick 2     kimg 280.6    lod 7.00  minibatch 256  time 30m 44s      sec/tick 348.9   sec/kimg 2.49    maintenance 515.8  gpumem 3.8 
tick 3     kimg 420.9    lod 7.00  minibatch 256  time 36m 34s      sec/tick 347.7   sec/kimg 2.48    maintenance 2.8    gpumem 3.8 
tick 4     kimg 561.2    lod 7.00  minibatch 256  time 42m 26s      sec/tick 348.7   sec/kimg 2.49    maintenance 3.3    gpumem 3.8 
tick 5     kimg 681.5    lod 6.87  minibatch 128  time 52m 00s      sec/tick 570.3   sec/kimg 4.74    maintenance 3.0    gpumem 4.2 
tick 6     kimg 801.8    lod 6.66  minibatch 128  time 1h 03m 16s   sec/tick 673.2   sec/kimg 5.60    maintenance 3.4    gpumem 4.2 
tick 7     kimg 922.1    lod 6.46  minibatch 128  time 1h 14m 37s   sec/tick 677.9   sec/kimg 5.63    maintenance 2.8    gpumem 4.2 
tick 8     kimg 1042.4   lod 6.26  minibatch 128  time 1h 25m 58s   sec/tick 677.5   sec/kimg 5.63    maintenance 3.1    gpumem 4.2 
tick 9     kimg 1162.8   lod 6.06  minibatch 128  time 1h 37m 22s   sec/tick 681.0   sec/kimg 5.66    maintenance 3.1    gpumem 4.2 
tick 10    kimg 1283.1   lod 6.00  minibatch 128  time 1h 48m 37s   sec/tick 672.1   sec/kimg 5.59    maintenance 2.8    gpumem 4.2 
network-snapshot-001283        time 5m 03s       fid50k 331.6563  
tick 11    kimg 1403.4   lod 6.00  minibatch 128  time 2h 05m 00s   sec/tick 673.5   sec/kimg 5.60    maintenance 310.3  gpumem 4.2 
tick 12    kimg 1523.7   lod 6.00  minibatch 128  time 2h 16m 11s   sec/tick 668.5   sec/kimg 5.56    maintenance 2.6    gpumem 4.2 
tick 13    kimg 1644.0   lod 6.00  minibatch 128  time 2h 27m 32s   sec/tick 677.8   sec/kimg 5.63    maintenance 2.9    gpumem 4.2 
tick 14    kimg 1764.4   lod 6.00  minibatch 128  time 2h 38m 44s   sec/tick 668.8   sec/kimg 5.56    maintenance 2.9    gpumem 4.2 
tick 15    kimg 1864.4   lod 5.89  minibatch 64   time 3h 01m 44s   sec/tick 1377.3  sec/kimg 13.76   maintenance 3.0    gpumem 4.2 
tick 16    kimg 1964.5   lod 5.73  minibatch 64   time 3h 32m 00s   sec/tick 1812.6  sec/kimg 18.11   maintenance 3.5    gpumem 4.2 
tick 17    kimg 2064.6   lod 5.56  minibatch 64   time 4h 02m 22s   sec/tick 1817.6  sec/kimg 18.16   maintenance 3.6    gpumem 4.2 
tick 18    kimg 2164.7   lod 5.39  minibatch 64   time 4h 32m 43s   sec/tick 1818.2  sec/kimg 18.16   maintenance 3.7    gpumem 4.2 
tick 19    kimg 2264.8   lod 5.23  minibatch 64   time 5h 03m 01s   sec/tick 1813.3  sec/kimg 18.12   maintenance 3.8    gpumem 4.2 
tick 20    kimg 2364.9   lod 5.06  minibatch 64   time 5h 33m 20s   sec/tick 1816.2  sec/kimg 18.14   maintenance 3.4    gpumem 4.2 
network-snapshot-002364        time 5m 35s       fid50k 239.0368  
tick 21    kimg 2465.0   lod 5.00  minibatch 64   time 6h 09m 15s   sec/tick 1813.4  sec/kimg 18.12   maintenance 341.7  gpumem 4.2 
tick 22    kimg 2565.1   lod 5.00  minibatch 64   time 6h 39m 31s   sec/tick 1813.0  sec/kimg 18.11   maintenance 3.2    gpumem 4.2 
tick 23    kimg 2665.2   lod 5.00  minibatch 64   time 7h 09m 47s   sec/tick 1812.4  sec/kimg 18.11   maintenance 3.0    gpumem 4.2 
tick 24    kimg 2765.3   lod 5.00  minibatch 64   time 7h 40m 05s   sec/tick 1814.9  sec/kimg 18.13   maintenance 3.3    gpumem 4.2 
tick 25    kimg 2865.4   lod 5.00  minibatch 64   time 8h 10m 20s   sec/tick 1812.0  sec/kimg 18.10   maintenance 3.3    gpumem 4.2 
tick 26    kimg 2965.5   lod 5.00  minibatch 64   time 8h 40m 30s   sec/tick 1805.8  sec/kimg 18.04   maintenance 3.5    gpumem 4.2 
tick 27    kimg 3045.5   lod 4.92  minibatch 32   time 9h 23m 10s   sec/tick 2557.2  sec/kimg 31.96   maintenance 3.1    gpumem 4.2 
tick 28    kimg 3125.5   lod 4.79  minibatch 32   time 10h 19m 34s  sec/tick 3380.5  sec/kimg 42.26   maintenance 3.8    gpumem 4.2 
tick 29    kimg 3205.5   lod 4.66  minibatch 32   time 11h 16m 06s  sec/tick 3388.1  sec/kimg 42.35   maintenance 3.8    gpumem 4.2 
tick 30    kimg 3285.5   lod 4.52  minibatch 32   time 12h 12m 28s  sec/tick 3378.0  sec/kimg 42.22   maintenance 4.0    gpumem 4.2 
network-snapshot-003285        time 6m 39s       fid50k 182.6515  
tick 31    kimg 3365.5   lod 4.39  minibatch 32   time 13h 15m 45s  sec/tick 3389.5  sec/kimg 42.37   maintenance 407.8  gpumem 4.2 
tick 32    kimg 3445.5   lod 4.26  minibatch 32   time 14h 12m 34s  sec/tick 3404.6  sec/kimg 42.56   maintenance 4.2    gpumem 4.2 
tick 33    kimg 3525.5   lod 4.12  minibatch 32   time 15h 09m 36s  sec/tick 3417.8  sec/kimg 42.72   maintenance 3.9    gpumem 4.2 
tick 34    kimg 3605.5   lod 4.00  minibatch 32   time 16h 06m 26s  sec/tick 3406.1  sec/kimg 42.58   maintenance 4.1    gpumem 4.2 
tick 35    kimg 3685.5   lod 4.00  minibatch 32   time 17h 03m 16s  sec/tick 3406.0  sec/kimg 42.58   maintenance 3.9    gpumem 4.2 
tick 36    kimg 3765.5   lod 4.00  minibatch 32   time 18h 00m 06s  sec/tick 3406.1  sec/kimg 42.58   maintenance 3.7    gpumem 4.2 

以下是扭矩的.pbs文件:

  #PBS    -N  stylegan
  #PBS    -o  /ghome/fengrl/home/LIA/stylegan/log/out/$PBS_JOBID.out
  #PBS    -e  /ghome/fengrl/home/LIA/stylegan/log/err/$PBS_JOBID.err
  #PBS    -l nodes=1:gpus=8:E
  #PBS    -r y
  #PBS    -q mcc
  cd $PBS_O_WORKDIR
  echo Time is `date`
  echo Directory is $PWD
  echo This job runs on following nodes:
  echo -n "Node:"
  cat $PBS_NODEFILE
  echo -n "Gpus:"
  cat $PBS_GPUFILE
  echo "CUDA_VISIBLE_DEVICES:"$CUDA_VISIBLE_DEVICES
  startdocker -u "-v /gpub:/fengrl" -P /ghome/fengrl/home/LIA/stylegan -D /gdata/fengrl/test -c "python /ghome/fengrl/home/LIA/stylegan/train.py"  bit:5000/cxs-py36-tf112-torch041

如果一切正常,它应该产生所有输出(分辨率高达1024*1024的快照图像,1024*1024分辨率图像的训练前登录信息),直到完成25000个图像的训练,但它实际上只产生3765.5k个图像的输出。你知道吗


Tags: timemaintenancesecmappingpbsupticksynthesis