我将stylegan代码(https://github.com/NVlabs/stylegan)下载到我的GPU集群,该集群由torque和nvidia docker管理。我唯一编辑过的地方就是配置.py文件,以输入正确的目录,所有其他设置是默认的。然而,当我将我的程序提交给torque时,使用正确设置的dockerimages,它只在前20小时正确地生成日志、快照和其他输出(当训练处理3765.5k图像时,当前分辨率=64*64),并且在之后的300小时内它从未生成任何输出。3765.5k图像的快照是正确的,程序运行在8个nivida K80 gpu中,每个gpu有12G内存,训练期间没有内存不足的消息或错误或警告。你知道吗
我创造了一个中止.txt'在run_dir中,一旦检测到更新进度可能会停止训练过程,但程序在过去4小时内没有停止。你知道吗
我把训练每个分辨率的图像数减少到2k,它成功地产生1024*1024的输出。你知道吗
我运行了4次代码,我也从github重新下载它,这个问题总是发生。你知道吗
我还减少了滴答图像到2k,程序成功产生60小时的输出,由于时间限制,我不得不停止它,我不知道它是否会使3765.5k后的图像。你知道吗
这是你的名字日志.txt公司名称:
dnnlib: Running training.training_loop.training_loop() on localhost...
Streaming data using training.dataset.TFRecordDataset...
Dataset shape = [3, 1024, 1024]
Dynamic range = [0, 255]
Label size = 0
Constructing networks...
G Params OutputShape WeightShape
--- --- --- ---
latents_in - (?, 512) -
labels_in - (?, 0) -
lod - () -
dlatent_avg - (512,) -
G_mapping/latents_in - (?, 512) -
G_mapping/labels_in - (?, 0) -
G_mapping/PixelNorm - (?, 512) -
G_mapping/Dense0 262656 (?, 512) (512, 512)
G_mapping/Dense1 262656 (?, 512) (512, 512)
G_mapping/Dense2 262656 (?, 512) (512, 512)
G_mapping/Dense3 262656 (?, 512) (512, 512)
G_mapping/Dense4 262656 (?, 512) (512, 512)
G_mapping/Dense5 262656 (?, 512) (512, 512)
G_mapping/Dense6 262656 (?, 512) (512, 512)
G_mapping/Dense7 262656 (?, 512) (512, 512)
G_mapping/Broadcast - (?, 18, 512) -
G_mapping/dlatents_out - (?, 18, 512) -
Truncation - (?, 18, 512) -
G_synthesis/dlatents_in - (?, 18, 512) -
G_synthesis/4x4/Const 534528 (?, 512, 4, 4) (512,)
G_synthesis/4x4/Conv 2885632 (?, 512, 4, 4) (3, 3, 512, 512)
G_synthesis/ToRGB_lod8 1539 (?, 3, 4, 4) (1, 1, 512, 3)
G_synthesis/8x8/Conv0_up 2885632 (?, 512, 8, 8) (3, 3, 512, 512)
G_synthesis/8x8/Conv1 2885632 (?, 512, 8, 8) (3, 3, 512, 512)
G_synthesis/ToRGB_lod7 1539 (?, 3, 8, 8) (1, 1, 512, 3)
G_synthesis/Upscale2D - (?, 3, 8, 8) -
G_synthesis/Grow_lod7 - (?, 3, 8, 8) -
G_synthesis/16x16/Conv0_up 2885632 (?, 512, 16, 16) (3, 3, 512, 512)
G_synthesis/16x16/Conv1 2885632 (?, 512, 16, 16) (3, 3, 512, 512)
G_synthesis/ToRGB_lod6 1539 (?, 3, 16, 16) (1, 1, 512, 3)
G_synthesis/Upscale2D_1 - (?, 3, 16, 16) -
G_synthesis/Grow_lod6 - (?, 3, 16, 16) -
G_synthesis/32x32/Conv0_up 2885632 (?, 512, 32, 32) (3, 3, 512, 512)
G_synthesis/32x32/Conv1 2885632 (?, 512, 32, 32) (3, 3, 512, 512)
G_synthesis/ToRGB_lod5 1539 (?, 3, 32, 32) (1, 1, 512, 3)
G_synthesis/Upscale2D_2 - (?, 3, 32, 32) -
G_synthesis/Grow_lod5 - (?, 3, 32, 32) -
G_synthesis/64x64/Conv0_up 1442816 (?, 256, 64, 64) (3, 3, 512, 256)
G_synthesis/64x64/Conv1 852992 (?, 256, 64, 64) (3, 3, 256, 256)
G_synthesis/ToRGB_lod4 771 (?, 3, 64, 64) (1, 1, 256, 3)
G_synthesis/Upscale2D_3 - (?, 3, 64, 64) -
G_synthesis/Grow_lod4 - (?, 3, 64, 64) -
G_synthesis/128x128/Conv0_up 426496 (?, 128, 128, 128) (3, 3, 256, 128)
G_synthesis/128x128/Conv1 279040 (?, 128, 128, 128) (3, 3, 128, 128)
G_synthesis/ToRGB_lod3 387 (?, 3, 128, 128) (1, 1, 128, 3)
G_synthesis/Upscale2D_4 - (?, 3, 128, 128) -
G_synthesis/Grow_lod3 - (?, 3, 128, 128) -
G_synthesis/256x256/Conv0_up 139520 (?, 64, 256, 256) (3, 3, 128, 64)
G_synthesis/256x256/Conv1 102656 (?, 64, 256, 256) (3, 3, 64, 64)
G_synthesis/ToRGB_lod2 195 (?, 3, 256, 256) (1, 1, 64, 3)
G_synthesis/Upscale2D_5 - (?, 3, 256, 256) -
G_synthesis/Grow_lod2 - (?, 3, 256, 256) -
G_synthesis/512x512/Conv0_up 51328 (?, 32, 512, 512) (3, 3, 64, 32)
G_synthesis/512x512/Conv1 42112 (?, 32, 512, 512) (3, 3, 32, 32)
G_synthesis/ToRGB_lod1 99 (?, 3, 512, 512) (1, 1, 32, 3)
G_synthesis/Upscale2D_6 - (?, 3, 512, 512) -
G_synthesis/Grow_lod1 - (?, 3, 512, 512) -
G_synthesis/1024x1024/Conv0_up 21056 (?, 16, 1024, 1024) (3, 3, 32, 16)
G_synthesis/1024x1024/Conv1 18752 (?, 16, 1024, 1024) (3, 3, 16, 16)
G_synthesis/ToRGB_lod0 51 (?, 3, 1024, 1024) (1, 1, 16, 3)
G_synthesis/Upscale2D_7 - (?, 3, 1024, 1024) -
G_synthesis/Grow_lod0 - (?, 3, 1024, 1024) -
G_synthesis/images_out - (?, 3, 1024, 1024) -
G_synthesis/lod - () -
G_synthesis/noise0 - (1, 1, 4, 4) -
G_synthesis/noise1 - (1, 1, 4, 4) -
G_synthesis/noise2 - (1, 1, 8, 8) -
G_synthesis/noise3 - (1, 1, 8, 8) -
G_synthesis/noise4 - (1, 1, 16, 16) -
G_synthesis/noise5 - (1, 1, 16, 16) -
G_synthesis/noise6 - (1, 1, 32, 32) -
G_synthesis/noise7 - (1, 1, 32, 32) -
G_synthesis/noise8 - (1, 1, 64, 64) -
G_synthesis/noise9 - (1, 1, 64, 64) -
G_synthesis/noise10 - (1, 1, 128, 128) -
G_synthesis/noise11 - (1, 1, 128, 128) -
G_synthesis/noise12 - (1, 1, 256, 256) -
G_synthesis/noise13 - (1, 1, 256, 256) -
G_synthesis/noise14 - (1, 1, 512, 512) -
G_synthesis/noise15 - (1, 1, 512, 512) -
G_synthesis/noise16 - (1, 1, 1024, 1024) -
G_synthesis/noise17 - (1, 1, 1024, 1024) -
images_out - (?, 3, 1024, 1024) -
--- --- --- ---
Total 26219627
D Params OutputShape WeightShape
--- --- --- ---
images_in - (?, 3, 1024, 1024) -
labels_in - (?, 0) -
lod - () -
FromRGB_lod0 64 (?, 16, 1024, 1024) (1, 1, 3, 16)
1024x1024/Conv0 2320 (?, 16, 1024, 1024) (3, 3, 16, 16)
1024x1024/Conv1_down 4640 (?, 32, 512, 512) (3, 3, 16, 32)
Downscale2D - (?, 3, 512, 512) -
FromRGB_lod1 128 (?, 32, 512, 512) (1, 1, 3, 32)
Grow_lod0 - (?, 32, 512, 512) -
512x512/Conv0 9248 (?, 32, 512, 512) (3, 3, 32, 32)
512x512/Conv1_down 18496 (?, 64, 256, 256) (3, 3, 32, 64)
Downscale2D_1 - (?, 3, 256, 256) -
FromRGB_lod2 256 (?, 64, 256, 256) (1, 1, 3, 64)
Grow_lod1 - (?, 64, 256, 256) -
256x256/Conv0 36928 (?, 64, 256, 256) (3, 3, 64, 64)
256x256/Conv1_down 73856 (?, 128, 128, 128) (3, 3, 64, 128)
Downscale2D_2 - (?, 3, 128, 128) -
FromRGB_lod3 512 (?, 128, 128, 128) (1, 1, 3, 128)
Grow_lod2 - (?, 128, 128, 128) -
128x128/Conv0 147584 (?, 128, 128, 128) (3, 3, 128, 128)
128x128/Conv1_down 295168 (?, 256, 64, 64) (3, 3, 128, 256)
Downscale2D_3 - (?, 3, 64, 64) -
FromRGB_lod4 1024 (?, 256, 64, 64) (1, 1, 3, 256)
Grow_lod3 - (?, 256, 64, 64) -
64x64/Conv0 590080 (?, 256, 64, 64) (3, 3, 256, 256)
64x64/Conv1_down 1180160 (?, 512, 32, 32) (3, 3, 256, 512)
Downscale2D_4 - (?, 3, 32, 32) -
FromRGB_lod5 2048 (?, 512, 32, 32) (1, 1, 3, 512)
Grow_lod4 - (?, 512, 32, 32) -
32x32/Conv0 2359808 (?, 512, 32, 32) (3, 3, 512, 512)
32x32/Conv1_down 2359808 (?, 512, 16, 16) (3, 3, 512, 512)
Downscale2D_5 - (?, 3, 16, 16) -
FromRGB_lod6 2048 (?, 512, 16, 16) (1, 1, 3, 512)
Grow_lod5 - (?, 512, 16, 16) -
16x16/Conv0 2359808 (?, 512, 16, 16) (3, 3, 512, 512)
16x16/Conv1_down 2359808 (?, 512, 8, 8) (3, 3, 512, 512)
Downscale2D_6 - (?, 3, 8, 8) -
FromRGB_lod7 2048 (?, 512, 8, 8) (1, 1, 3, 512)
Grow_lod6 - (?, 512, 8, 8) -
8x8/Conv0 2359808 (?, 512, 8, 8) (3, 3, 512, 512)
8x8/Conv1_down 2359808 (?, 512, 4, 4) (3, 3, 512, 512)
Downscale2D_7 - (?, 3, 4, 4) -
FromRGB_lod8 2048 (?, 512, 4, 4) (1, 1, 3, 512)
Grow_lod7 - (?, 512, 4, 4) -
4x4/MinibatchStddev - (?, 513, 4, 4) -
4x4/Conv 2364416 (?, 512, 4, 4) (3, 3, 513, 512)
4x4/Dense0 4194816 (?, 512) (8192, 512)
4x4/Dense1 513 (?, 1) (512, 1)
scores_out - (?, 1) -
--- --- --- ---
Total 23087249
Building TensorFlow graph...
Setting up snapshot image grid...
Setting up run dir...
Training...
tick 1 kimg 140.3 lod 7.00 minibatch 256 time 16m 19s sec/tick 418.8 sec/kimg 2.99 maintenance 560.3 gpumem 3.8
network-snapshot-000140 time 4m 57s fid50k 411.7487
tick 2 kimg 280.6 lod 7.00 minibatch 256 time 30m 44s sec/tick 348.9 sec/kimg 2.49 maintenance 515.8 gpumem 3.8
tick 3 kimg 420.9 lod 7.00 minibatch 256 time 36m 34s sec/tick 347.7 sec/kimg 2.48 maintenance 2.8 gpumem 3.8
tick 4 kimg 561.2 lod 7.00 minibatch 256 time 42m 26s sec/tick 348.7 sec/kimg 2.49 maintenance 3.3 gpumem 3.8
tick 5 kimg 681.5 lod 6.87 minibatch 128 time 52m 00s sec/tick 570.3 sec/kimg 4.74 maintenance 3.0 gpumem 4.2
tick 6 kimg 801.8 lod 6.66 minibatch 128 time 1h 03m 16s sec/tick 673.2 sec/kimg 5.60 maintenance 3.4 gpumem 4.2
tick 7 kimg 922.1 lod 6.46 minibatch 128 time 1h 14m 37s sec/tick 677.9 sec/kimg 5.63 maintenance 2.8 gpumem 4.2
tick 8 kimg 1042.4 lod 6.26 minibatch 128 time 1h 25m 58s sec/tick 677.5 sec/kimg 5.63 maintenance 3.1 gpumem 4.2
tick 9 kimg 1162.8 lod 6.06 minibatch 128 time 1h 37m 22s sec/tick 681.0 sec/kimg 5.66 maintenance 3.1 gpumem 4.2
tick 10 kimg 1283.1 lod 6.00 minibatch 128 time 1h 48m 37s sec/tick 672.1 sec/kimg 5.59 maintenance 2.8 gpumem 4.2
network-snapshot-001283 time 5m 03s fid50k 331.6563
tick 11 kimg 1403.4 lod 6.00 minibatch 128 time 2h 05m 00s sec/tick 673.5 sec/kimg 5.60 maintenance 310.3 gpumem 4.2
tick 12 kimg 1523.7 lod 6.00 minibatch 128 time 2h 16m 11s sec/tick 668.5 sec/kimg 5.56 maintenance 2.6 gpumem 4.2
tick 13 kimg 1644.0 lod 6.00 minibatch 128 time 2h 27m 32s sec/tick 677.8 sec/kimg 5.63 maintenance 2.9 gpumem 4.2
tick 14 kimg 1764.4 lod 6.00 minibatch 128 time 2h 38m 44s sec/tick 668.8 sec/kimg 5.56 maintenance 2.9 gpumem 4.2
tick 15 kimg 1864.4 lod 5.89 minibatch 64 time 3h 01m 44s sec/tick 1377.3 sec/kimg 13.76 maintenance 3.0 gpumem 4.2
tick 16 kimg 1964.5 lod 5.73 minibatch 64 time 3h 32m 00s sec/tick 1812.6 sec/kimg 18.11 maintenance 3.5 gpumem 4.2
tick 17 kimg 2064.6 lod 5.56 minibatch 64 time 4h 02m 22s sec/tick 1817.6 sec/kimg 18.16 maintenance 3.6 gpumem 4.2
tick 18 kimg 2164.7 lod 5.39 minibatch 64 time 4h 32m 43s sec/tick 1818.2 sec/kimg 18.16 maintenance 3.7 gpumem 4.2
tick 19 kimg 2264.8 lod 5.23 minibatch 64 time 5h 03m 01s sec/tick 1813.3 sec/kimg 18.12 maintenance 3.8 gpumem 4.2
tick 20 kimg 2364.9 lod 5.06 minibatch 64 time 5h 33m 20s sec/tick 1816.2 sec/kimg 18.14 maintenance 3.4 gpumem 4.2
network-snapshot-002364 time 5m 35s fid50k 239.0368
tick 21 kimg 2465.0 lod 5.00 minibatch 64 time 6h 09m 15s sec/tick 1813.4 sec/kimg 18.12 maintenance 341.7 gpumem 4.2
tick 22 kimg 2565.1 lod 5.00 minibatch 64 time 6h 39m 31s sec/tick 1813.0 sec/kimg 18.11 maintenance 3.2 gpumem 4.2
tick 23 kimg 2665.2 lod 5.00 minibatch 64 time 7h 09m 47s sec/tick 1812.4 sec/kimg 18.11 maintenance 3.0 gpumem 4.2
tick 24 kimg 2765.3 lod 5.00 minibatch 64 time 7h 40m 05s sec/tick 1814.9 sec/kimg 18.13 maintenance 3.3 gpumem 4.2
tick 25 kimg 2865.4 lod 5.00 minibatch 64 time 8h 10m 20s sec/tick 1812.0 sec/kimg 18.10 maintenance 3.3 gpumem 4.2
tick 26 kimg 2965.5 lod 5.00 minibatch 64 time 8h 40m 30s sec/tick 1805.8 sec/kimg 18.04 maintenance 3.5 gpumem 4.2
tick 27 kimg 3045.5 lod 4.92 minibatch 32 time 9h 23m 10s sec/tick 2557.2 sec/kimg 31.96 maintenance 3.1 gpumem 4.2
tick 28 kimg 3125.5 lod 4.79 minibatch 32 time 10h 19m 34s sec/tick 3380.5 sec/kimg 42.26 maintenance 3.8 gpumem 4.2
tick 29 kimg 3205.5 lod 4.66 minibatch 32 time 11h 16m 06s sec/tick 3388.1 sec/kimg 42.35 maintenance 3.8 gpumem 4.2
tick 30 kimg 3285.5 lod 4.52 minibatch 32 time 12h 12m 28s sec/tick 3378.0 sec/kimg 42.22 maintenance 4.0 gpumem 4.2
network-snapshot-003285 time 6m 39s fid50k 182.6515
tick 31 kimg 3365.5 lod 4.39 minibatch 32 time 13h 15m 45s sec/tick 3389.5 sec/kimg 42.37 maintenance 407.8 gpumem 4.2
tick 32 kimg 3445.5 lod 4.26 minibatch 32 time 14h 12m 34s sec/tick 3404.6 sec/kimg 42.56 maintenance 4.2 gpumem 4.2
tick 33 kimg 3525.5 lod 4.12 minibatch 32 time 15h 09m 36s sec/tick 3417.8 sec/kimg 42.72 maintenance 3.9 gpumem 4.2
tick 34 kimg 3605.5 lod 4.00 minibatch 32 time 16h 06m 26s sec/tick 3406.1 sec/kimg 42.58 maintenance 4.1 gpumem 4.2
tick 35 kimg 3685.5 lod 4.00 minibatch 32 time 17h 03m 16s sec/tick 3406.0 sec/kimg 42.58 maintenance 3.9 gpumem 4.2
tick 36 kimg 3765.5 lod 4.00 minibatch 32 time 18h 00m 06s sec/tick 3406.1 sec/kimg 42.58 maintenance 3.7 gpumem 4.2
以下是扭矩的.pbs文件:
#PBS -N stylegan
#PBS -o /ghome/fengrl/home/LIA/stylegan/log/out/$PBS_JOBID.out
#PBS -e /ghome/fengrl/home/LIA/stylegan/log/err/$PBS_JOBID.err
#PBS -l nodes=1:gpus=8:E
#PBS -r y
#PBS -q mcc
cd $PBS_O_WORKDIR
echo Time is `date`
echo Directory is $PWD
echo This job runs on following nodes:
echo -n "Node:"
cat $PBS_NODEFILE
echo -n "Gpus:"
cat $PBS_GPUFILE
echo "CUDA_VISIBLE_DEVICES:"$CUDA_VISIBLE_DEVICES
startdocker -u "-v /gpub:/fengrl" -P /ghome/fengrl/home/LIA/stylegan -D /gdata/fengrl/test -c "python /ghome/fengrl/home/LIA/stylegan/train.py" bit:5000/cxs-py36-tf112-torch041
如果一切正常,它应该产生所有输出(分辨率高达1024*1024的快照图像,1024*1024分辨率图像的训练前登录信息),直到完成25000个图像的训练,但它实际上只产生3765.5k个图像的输出。你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐