TPU问题。过渡TF 1.3至TF 2.1

2024-09-29 18:54:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试将完美工作的代码从1.3转换为2.1

我尽可能地简化了模型,但它仍然不起作用。当我在Jupyter中运行下面的代码时,当它转到fit时,内核死亡

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import tensorflow.keras as k

print('TF v:', tf.__version__, 'Keras v:', k.__version__)

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://xx.xx.xx.xx:8470')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)    

strategy = tf.distribute.experimental.TPUStrategy(resolver) 
with strategy.scope():

    model = k.Sequential()
    model.add(k.layers.Conv1D(filters=16,  kernel_size=2, activation = 'relu', input_shape=(window_size, 1) ))
    model.add(k.layers.Conv1D(filters=32,  kernel_size=2, activation = 'relu'))
    model.add(k.layers.Conv1D(filters=64,  kernel_size=2, activation = 'relu'))
    model.add(k.layers.Conv1D(filters=128, kernel_size=2, activation = 'relu'))
    model.add(k.layers.MaxPooling1D(pool_size=2))
    model.add(k.layers.Flatten())
    model.add(k.layers.Dense(cats, activation='softmax'))

    # summary
    print(model.metrics_names)
    print(model.summary())

    print('--')
    model.compile(optimizer='adam', loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                  metrics=['categorical_accuracy'])
    print('--')
model.fit(X, y, batch_size = window_size, shuffle=False, epochs = 5)

输出:

TF v: 2.1.0 Keras v: 2.2.4-tf
INFO:tensorflow:Initializing the TPU system: xxxxxxxxxx:8470
INFO:tensorflow:Initializing the TPU system: xxxxxxxxxx:8470
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
['loss']
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d (Conv1D)              (None, 1279, 16)          48        
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 1278, 32)          1056      
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 1277, 64)          4160      
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 1276, 128)         16512     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 638, 128)          0         
_________________________________________________________________
flatten (Flatten)            (None, 81664)             0         
_________________________________________________________________
dense (Dense)                (None, 4)                 326660    
=================================================================
Total params: 348,436
Trainable params: 348,436
Non-trainable params: 0
_________________________________________________________________
None
--
--

我可以在控制台中看到这个错误——我不知道proto-buf是从哪里来的,为什么它在tf1.3中工作

E0208 17:03:32.001652096    4567 proto_buffer_writer.h:83]   assertion failed: byte_count_ < total_size_

有什么想法吗


Tags: infotasksizemodeldevicetftensorflowjob
1条回答
网友
1楼 · 发布于 2024-09-29 18:54:32

它似乎主要与ProtoBuf有关,而不是TensorFlow-ProtoBuf的硬限制是每次调用2GB,TensorFlow只能通过多个ProtoBuf消息分割tf.data.Dataset个实体。您应该使数据集小于2 GB,或者将其转换为TensorFlow数据集格式。资料来源:1234

相关问题 更多 >

    热门问题