无法在正式resnet模型中使用自定义数据集,已引发tensorflow.python.framework框架.错误_执行取消

2024-10-02 00:37:59 发布

您现在位置:Python中文网/ 问答频道 /正文

这些天我修改了官方的cifar10_main.py,以便训练kaggle dogs\u cats\u redux数据集。你知道吗

首先,我创建了tfrecord文件,按照标准管道,你们可以下载tfrecord文件here.

然后,我编写了一些tfrecord解析函数和dogs\u cats\u模型类,剩下的代码与原来的resnet repo保持一致,你们可以查看我的main.pyhere。你知道吗

但是当我运行main.py时,它引发了cancelled错误:

Traceback (most recent call last):
  File "main.py", line 326, in <module>
    main(argv=sys.argv)
  File "main.py", line 321, in main
    shape=[_HEIGHT, _WIDTH, _NUM_CHANNELS])
  File "/home/jto/projects/dogs_cats_tf/official/resnet/resnet_run_loop.py", line 396, in resnet_main
    max_steps=flags.max_train_steps)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 363, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 843, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 859, in _train_model_default
    saving_listeners)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1059, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 567, in run
    run_metadata=run_metadata)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1043, in run
    run_metadata=run_metadata)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1134, in run
    raise six.reraise(*original_exc_info)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/six.py", line 686, in reraise
    raise value
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1119, in run
    return self._sess.run(*args, **kwargs)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1191, in run
    run_metadata=run_metadata)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 971, in run
    return self._sess.run(*args, **kwargs)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Queue '_2_input_producer' is already closed.
         [[Node: input_producer/input_producer_Close = QueueCloseV2[cancel_pending_enqueues=false](input_producer)]]
         [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,224,224,3], [?,2]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
         [[Node: IteratorGetNext/_2401 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_534_IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

我在google上搜索了很多解决方案,大多数都说这是因为数据队列不知何故停止了,或者我们没有正确启动队列,解决方案如下:

# Start the data queue
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess, coord)

但是官方的调查显示_loop.resnet\u主()使用估计量为了训练模型,源代码不需要像那样启动队列,那么我们如何解决这个问题呢?如有任何意见,将不胜感激。你知道吗

系统信息: ubuntu 16.04 LTS版 tensorflow gpu v1.8.0版 cuda 9.0版 第7.1节

github问题是here.


Tags: runinpyhomelibpackagestftensorflow

热门问题