为什么即使我已经提到了第100个纪元,模型在第一个纪元之后停止训练却没有任何警告?

2024-07-01 07:27:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在谷歌colab上运行retinanet模型,并提供GPU支持,但在开始训练1个历元后,它会在没有正确训练的情况下快速完成1000个步骤,并在没有任何警告的情况下停止训练

这是运行列车命令后得到的终端窗口的输出

!keras_retinanet/bin/train.py --tensorboard-dir /content/TrainingOutput --snapshot-path /content/TrainingOutput/snapshots --random-transform --steps 1000 pascal /content/PlumsVOC



Creating model, this may take a second...
2021-08-19 03:38:20.717241: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.725782: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.726450: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.727359: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-19 03:38:20.727598: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.728167: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:20.728749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.263376: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.264133: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.264721: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-19 03:38:21.265247: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-19 03:38:21.265304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13839 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py:356: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
  "The `lr` argument is deprecated, use `learning_rate` instead.")
Model: "retinanet"

__________________________________________________________________________________________________
None
WARNING:tensorflow:`batch_size` is no longer needed in the `TensorBoard` Callback and will be ignored in TensorFlow 2.0.
2021-08-19 03:38:24.467332: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-19 03:38:24.467379: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-08-19 03:38:24.467435: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-08-19 03:38:24.588819: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-19 03:38:24.589029: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py:1972: UserWarning: `Model.fit_generator` is deprecated and will be removed in a future version. Please use `Model.fit`, which supports generators.
  warnings.warn('`Model.fit_generator` is deprecated and '
/usr/local/lib/python3.7/dist-packages/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
  category=CustomMaskWarning)
2021-08-19 03:38:25.187697: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/50
2021-08-19 03:38:32.881842: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8004
   1/1000 [..............................] - ETA: 3:31:30 - loss: 3.8681 - regression_loss: 2.7375 - classification_loss: 1.13062021-08-19 03:38:38.104179: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-19 03:38:38.104232: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
   2/1000 [..............................] - ETA: 17:05 - loss: 3.8988 - regression_loss: 2.7693 - classification_loss: 1.1295  2021-08-19 03:38:38.938537: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2021-08-19 03:38:38.940902: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
2021-08-19 03:38:39.134281: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:673]  GpuTracer has collected 3251 callback api events and 3247 activity events. 
2021-08-19 03:38:39.192167: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-19 03:38:39.289977: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39

2021-08-19 03:38:39.355897: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.trace.json.gz
2021-08-19 03:38:39.455150: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39

2021-08-19 03:38:39.462678: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.memory_profile.json.gz
2021-08-19 03:38:39.466401: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39
Dumped tool data for xplane.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.xplane.pb
Dumped tool data for overview_page.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.overview_page.pb
Dumped tool data for input_pipeline.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /content/TrainingOutput/train/plugins/profile/2021_08_19_03_38_39/7409a69dd529.kernel_stats.pb

  11/1000 [..............................] - ETA: 6:57 - loss: 3.9632 - regression_loss: 2.8365 - classification_loss: 1.1267WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 50000 batches). You may need to use the repeat() function when building your dataset.
1000/1000 [==============================] - 17s 4ms/step - loss: 3.9632 - regression_loss: 2.8365 - classification_loss: 1.1267
Running network: 100% (4 of 4) |##########| Elapsed Time: 0:00:02 Time:  0:00:02
Parsing annotations: 100% (4 of 4) |######| Elapsed Time: 0:00:00 Time:  0:00:00
32 instances of class redPlum with average precision: 0.0000
0 instances of class greenPlum with average precision: 0.0000
mAP: 0.0000

Epoch 00001: saving model to /content/TrainingOutput/snapshots/resnet50_pascal_01.h5
/usr/local/lib/python3.7/dist-packages/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
  category=CustomMaskWarning)

它可以节省模型重量,但不会检测测试图像中的任何对象。 发生了什么事?我怎样才能解决这个问题,并正常地对模型进行指定次数的完全训练?这方面的任何帮助都将非常有用,谢谢


Tags: corenodegpusessiontensorflowbecontentprofile

热门问题