我在tensorflow中实现了一个简单的培训师课程。我正在运行一些实验来检查代码性能,但是我在理解tf.data.Dataset和tf.function引擎盖下发生的事情时遇到了问题
在下面,我将介绍我已经运行的测试,最后会有一些关于我得到的结果的问题
配置:Intel i3 cpu、tensorflow cpu 2.1
class Trainer:
def __init__(self, model, optimizer, loss):
self.model = model
self.loss_function = loss
self.optimizer = optimizer
@tf.function
def train_step(self, inputs, targets):
with tf.GradientTape() as tape:
predictions = self.model(inputs)
loss = self.loss_function(targets, predictions)
gradients = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
return loss
# fit using dataset
@tf.function
def fit0(self, dataset, epochs):
for epoch in tf.range(epochs):
for input_batch, target_batch in dataset:
self.train_step(input_batch, target_batch)
# fit using list of tensors
@tf.function
def fit1(self, inputs, targets, epochs):
for epoch in tf.range(epochs):
for input_batch, target_batch in zip(inputs, targets):
self.train_step(input_batch, target_batch)
在下面的训练中,步骤将始终包装在tf.function中
fit0,fit1将使用和不使用tf.功能进行测试
下面是我运行测试时使用的代码:
input_size = 10000
batch_size = 100
q = input_size // batch_size
# create random inputs (x) and outputs (y)
x = tf.random.normal((input_size, 1), dtype=tf.float32)
y = tf.random.normal((input_size, 1), dtype=tf.float32)
splits = tf.fill([q, ], batch_size)
# create a list of tensors rappresenting batches
x_list = tf.split(x, splits)
y_list = tf.split(y, splits)
# create datasets in the different ways
dataset0 = tf.data.Dataset.from_tensor_slices((x, y)).batch(batch_size)
dataset1 = tf.data.Dataset.from_tensor_slices((tf.stack(x_list), tf.stack(y_list)))
# model definition
model = tf.keras.Sequential([
tf.keras.layers.Dense(20, activation='tanh', input_shape=(1,)),
tf.keras.layers.Dense(1, activation='linear')])
# trainer initialization
trainer = Trainer(model=model, optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.MeanSquaredError())
# first run to perform initializations
time0 = time.time()
trainer.fit0(dataset=dataset0, epochs=tf.constant(1, dtype=tf.int32))
time0 = time.time() - time0
time1 = time.time()
trainer.fit0(dataset=dataset1, epochs=tf.constant(1, dtype=tf.int32))
time1 = time.time() - time1
time2 = time.time()
trainer.fit1(inputs=x_list, targets=y_list, epochs=tf.constant(1, dtype=tf.int32))
time2 = time.time() - time2
print("first fit0 with dataset0 took {} seconds".format(time0))
print("first fit0 with dataset1 took {} seconds".format(time1))
print("first fit1 with tensorlist took {} seconds".format(time2))
# measure performances
time0 = time.time()
trainer.fit0(dataset=dataset0, epochs=tf.constant(100, dtype=tf.int32))
time0 = time.time() - time0
time1 = time.time()
trainer.fit0(dataset=dataset1, epochs=tf.constant(100, dtype=tf.int32))
time1 = time.time() - time1
time2 = time.time()
trainer.fit1(inputs=x_list, targets=y_list, epochs=tf.constant(100, dtype=tf.int32))
time2 = time.time() - time2
print("fit0 with dataset0 took {} seconds".format(time0))
print("fit0 with dataset1 took {} seconds".format(time1))
print("fit1 with tensorlist took {} seconds".format(time2))
以下是测试结果:
第一次试验是100批,每批100个样品
input_size = 10000
batch_size = 100without @tf.function:
first fit0 with dataset0 took 0.9953532218933105 seconds
first fit0 with dataset1 took 0.07995295524597168 seconds
first fit1 with tensorlist took 0.05196571350097656 seconds
fit0 with dataset0 took 10.46957802772522 seconds
fit0 with dataset1 took 7.822799205780029 seconds
fit1 with tensorlist took 4.650130748748779 secondswith @tf.function:
first fit0 with dataset0 took 1.4042332172393799 seconds
first fit0 with dataset1 took 0.46071624755859375 seconds
first fit1 with tensorlist took 7.3524699211120605 seconds
fit0 with dataset0 took 15.077088832855225 seconds
fit0 with dataset1 took 9.136569738388062 seconds
fit1 with tensorlist took 2.1366817951202393 seconds
第二批为1批10万份样品
input_size = 100000
batch_size = 100000without @tf.function:
first fit0 with dataset0 took 1.1792669296264648 seconds
first fit0 with dataset1 took 0.027983427047729492 seconds
first fit1 with tensorlist took 0.020987749099731445 seconds
fit0 with dataset0 took 28.71895956993103 seconds
fit0 with dataset1 took 2.730872869491577 seconds
fit1 with tensorlist took 2.194814682006836 secondswith @tf.function:
first fit0 with dataset0 took 1.5979444980621338 seconds
first fit0 with dataset1 took 0.4557182788848877 seconds
first fit1 with tensorlist took 0.3708038330078125 seconds
fit0 with dataset0 took 36.43854784965515 seconds
fit0 with dataset1 took 9.819332122802734 seconds
fit1 with tensorlist took 2.1136972904205322 seconds
问题:
我获得了良好的性能改进,代码和结果如下所示。
然而,我只能部分回答这些问题,特别是第二个问题仍然悬而未决
配置:英特尔i3 cpu、tensorflow cpu 2.1
以下是改进后的函数代码fit0,培训师课程的其余部分保持不变:
下面是我运行测试时使用的代码:
我不知道发动机罩下到底发生了什么,但可以通过更换以下部件来解决:
使用这种新的数据集,其中还包括epoches以及使用缓存和预取
有关更多信息,请参见here
我测试了fit0,fit1有无tf.function,但有了这些变化,我总是通过使用tf.function获得更好的性能,因此只显示后者
使用的输入_大小是10倍大。以下是测试结果:
第一次试验为1000批,每批100个样品。
请注意,与num_unroll=1相比,num_unroll=5提高了性能。设置“展开数”>;5没有提供任何进一步的改进
第二批为1批1000000份样品
上述结果可以回答以下问题:
通过检查张力板上的图形结构很容易看出,在fit1函数上使用autograph可以通过完全展开循环创建非常大的图形。 这提供了更好的性能,但创建图形的时间很长,很可能是内存使用过度,这使得它无法用于更复杂的问题。
但是,如上所示,使用tf.data.Dataset可以实现相同的性能,只需几个展开的循环,并随之改善图形大小
相关问题 更多 >
编程相关推荐