我正在将强化学习应用于时间序列预测问题。到目前为止,我已经用LSTM实现了一个决斗DDQN算法,它似乎给出了一些很好的结果,尽管有时收敛速度很慢,这取决于确切的问题。然后,我使用C51分布式强化学习来比较性能(我希望这会带来更好的结果)。
我稍微修改了谷歌代码dopamine,将其集成到我的代码中(网络和培训部分)。我还使用了双Q学习来选择下一个状态动作(原始代码没有使用)。但是,问题是它执行起来真的非常非常慢。相比之下,我之前的决斗DDQN过去需要3.5h才能训练50000集,而C51算法现在花了近10个小时,但只达到了3000集。
我想知道我对代码的适应是否有问题,或者C51算法是否真的那么慢。我使用的是NVidia Geforce RTX 2080Ti。
以下是网络部分:
#network part
self.weights_initializer = tf.contrib.slim.variance_scaling_initializer(factor=1.0 / np.sqrt(3.0), mode='FAN_IN', uniform=True)
self.net = tf.contrib.slim.fully_connected(
self.rnn, # output of an LSTM
num_actions * num_atoms,
activation_fn=None,
weights_initializer=self.weights_initializer)
self.logits = tf.reshape(self.net, [-1, num_actions, num_atoms])
self.probabilities = tf.contrib.layers.softmax(self.logits)
self.q_values = tf.reduce_sum(self._support * self.probabilities, axis=2)
self.predict = tf.argmax(self.q_values,1)
self.actions = tf.placeholder(shape=[None],dtype=tf.int32)
self.target_distribution = tf.placeholder(shape=[None,num_atoms],dtype=tf.float32)
# size of indices: batch_size x 1.
self.indices = tf.range(tf.shape(self.logits)[0])[:, None]
# size of reshaped_actions: batch_size x 2.
self.reshaped_actions = tf.concat([self.indices, self.actions[:, None]], 1)
# For each element of the batch, fetch the logits for its selected action.
self.chosen_action_logits = tf.gather_nd(self.logits,
self.reshaped_actions)
self.td_error = tf.nn.softmax_cross_entropy_with_logits(labels=self.target_distribution,logits=self.chosen_action_logits)
# divide by the real length of episodes instead of averaging which is incorrect
self.loss = tf.cast(tf.reduce_sum(self.td_error), tf.float64) / tf.cast(tf.reduce_sum(self.seq_len), tf.float64)
if apply_grad_clipping:
# calculate gradients and clip them to handle outliers
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(self.loss, tvars), grad_clipping)
self.updateModel = optimizer.apply_gradients(
zip(grads, tvars),
name="updateModel")
else:
self.updateModel = optimizer.minimize(self.loss, name="updateModel")下面是训练部分:
# training part
if i >= pre_train_episodes:
#Reset the lstm's hidden state
state_train = np.zeros((num_layers, 2, batch_size, h_size))
#Get a random batch of experiences.
trainBatch = myBuffer.sample(batch_size)
#Below we perform the Double-DQN update to the target Q-values
num_samples = batch_size*trace_length
# size of rewards: batch_size x 1
rewards = trainBatch[:,2][:, None]
# size of tiled_support: batch_size x num_atoms
tiled_support = tf.tile(mainQN._support, [num_samples])
tiled_support = tf.reshape(tiled_support, [num_samples, num_atoms])
# size of target_support: batch_size x num_atoms
is_terminal_multiplier = -(np.array(trainBatch[:,4]) - 1)
# Incorporate terminal state to discount factor.
# size of gamma_with_terminal: batch_size x 1
gamma_with_terminal = gamma * is_terminal_multiplier
gamma_with_terminal = gamma_with_terminal[:, None]
target_support = rewards + gamma_with_terminal * tiled_support
next_qt_argmax = sess.run([mainQN.predict], feed_dict={\
mainQN.scalarInput:np.vstack(trainBatch[:,3]),\
mainQN.trainLength:trace_length,mainQN.state_in:state_train,mainQN.batch_size:batch_size})
next_qt_argmax = np.reshape(next_qt_argmax, [-1, 1])
probabilities = sess.run(targetQN.probabilities, feed_dict={\
targetQN.scalarInput:np.vstack(trainBatch[:,3]),\
targetQN.trainLength:trace_length,targetQN.state_in:state_train,targetQN.batch_size:batch_size})
batch_indices = np.arange(num_samples)[:, None]
batch_indexed_next_qt_argmax = np.concatenate([batch_indices, next_qt_argmax], axis=1)
# size of next_probabilities: batch_size x num_atoms
next_probabilities = tf.gather_nd(probabilities, batch_indexed_next_qt_argmax)
target_distribution = project_distribution(target_support, next_probabilities, mainQN._support)
target_distribution = target_distribution.eval()
loss, _, _ = sess.run([mainQN.loss, mainQN.check_ops, mainQN.updateModel], \
feed_dict={mainQN.scalarInput:np.vstack(trainBatch[:,0]),mainQN.target_distribution:target_distribution,\
mainQN.actions:trainBatch[:,1],mainQN.trainLength:trace_length,\
mainQN.state_in:state_train,mainQN.batch_size:batch_size})
# perform soft/hard update frequently
if i % update_target_freq == 0 or update_target_freq == 1 or softUpdate == True:
updateTarget(targetOps,sess)辅助功能:
# function used above to project the distribution on the provided support
def project_distribution(supports, weights, target_support,
validate_args=False):
"""Projects a batch of (support, weights) onto target_support.
Based on equation (7) in (Bellemare et al., 2017):
https://arxiv.org/abs/1707.06887
In the rest of the comments we will refer to this equation simply as Eq7.
This code is not easy to digest, so we will use a running example to clarify
what is going on, with the following sample inputs:
* supports = [[0, 2, 4, 6, 8],
[1, 3, 4, 5, 6]]
* weights = [[0.1, 0.6, 0.1, 0.1, 0.1],
[0.1, 0.2, 0.5, 0.1, 0.1]]
* target_support = [4, 5, 6, 7, 8]
In the code below, comments preceded with 'Ex:' will be referencing the above
values.
Args:
supports: Tensor of shape (batch_size, num_dims) defining supports for the
distribution.
weights: Tensor of shape (batch_size, num_dims) defining weights on the
original support points. Although for the CategoricalDQN agent these
weights are probabilities, it is not required that they are.
target_support: Tensor of shape (num_dims) defining support of the projected
distribution. The values must be monotonically increasing. Vmin and Vmax
will be inferred from the first and last elements of this tensor,
respectively. The values in this tensor must be equally spaced.
validate_args: Whether we will verify the contents of the
target_support parameter.
Returns:
A Tensor of shape (batch_size, num_dims) with the projection of a batch of
(support, weights) onto target_support.
Raises:
ValueError: If target_support has no dimensions, or if shapes of supports,
weights, and target_support are incompatible.
"""
target_support_deltas = target_support[1:] - target_support[:-1]
# delta_z = `\Delta z` in Eq7.
delta_z = target_support_deltas[0]
validate_deps = []
supports.shape.assert_is_compatible_with(weights.shape)
supports[0].shape.assert_is_compatible_with(target_support.shape)
target_support.shape.assert_has_rank(1)
if validate_args:
# Assert that supports and weights have the same shapes.
validate_deps.append(
tf.Assert(
tf.reduce_all(tf.equal(tf.shape(supports), tf.shape(weights))),
[supports, weights]))
# Assert that elements of supports and target_support have the same shape.
validate_deps.append(
tf.Assert(
tf.reduce_all(
tf.equal(tf.shape(supports)[1], tf.shape(target_support))),
[supports, target_support]))
# Assert that target_support has a single dimension.
validate_deps.append(
tf.Assert(
tf.equal(tf.size(tf.shape(target_support)), 1), [target_support]))
# Assert that the target_support is monotonically increasing.
validate_deps.append(
tf.Assert(tf.reduce_all(target_support_deltas > 0), [target_support]))
# Assert that the values in target_support are equally spaced.
validate_deps.append(
tf.Assert(
tf.reduce_all(tf.equal(target_support_deltas, delta_z)),
[target_support]))
with tf.control_dependencies(validate_deps):
# Ex: `v_min, v_max = 4, 8`.
v_min, v_max = target_support[0], target_support[-1]
# Ex: `batch_size = 2`.
batch_size = tf.shape(supports)[0]
# `N` in Eq7.
# Ex: `num_dims = 5`.
num_dims = tf.shape(target_support)[0]
# clipped_support = `[\hat{T}_{z_j}]^{V_max}_{V_min}` in Eq7.
# Ex: `clipped_support = [[[ 4. 4. 4. 6. 8.]]
# [[ 4. 4. 4. 5. 6.]]]`.
clipped_support = tf.clip_by_value(supports, v_min, v_max)[:, None, :]
# Ex: `tiled_support = [[[[ 4. 4. 4. 6. 8.]
# [ 4. 4. 4. 6. 8.]
# [ 4. 4. 4. 6. 8.]
# [ 4. 4. 4. 6. 8.]
# [ 4. 4. 4. 6. 8.]]
# [[ 4. 4. 4. 5. 6.]
# [ 4. 4. 4. 5. 6.]
# [ 4. 4. 4. 5. 6.]
# [ 4. 4. 4. 5. 6.]
# [ 4. 4. 4. 5. 6.]]]]`.
tiled_support = tf.tile([clipped_support], [1, 1, num_dims, 1])
# Ex: `reshaped_target_support = [[[ 4.]
# [ 5.]
# [ 6.]
# [ 7.]
# [ 8.]]
# [[ 4.]
# [ 5.]
# [ 6.]
# [ 7.]
# [ 8.]]]`.
reshaped_target_support = tf.tile(target_support[:, None], [batch_size, 1])
reshaped_target_support = tf.reshape(reshaped_target_support,
[batch_size, num_dims, 1])
# numerator = `|clipped_support - z_i|` in Eq7.
# Ex: `numerator = [[[[ 0. 0. 0. 2. 4.]
# [ 1. 1. 1. 1. 3.]
# [ 2. 2. 2. 0. 2.]
# [ 3. 3. 3. 1. 1.]
# [ 4. 4. 4. 2. 0.]]
# [[ 0. 0. 0. 1. 2.]
# [ 1. 1. 1. 0. 1.]
# [ 2. 2. 2. 1. 0.]
# [ 3. 3. 3. 2. 1.]
# [ 4. 4. 4. 3. 2.]]]]`.
numerator = tf.abs(tiled_support - reshaped_target_support)
quotient = 1 - (numerator / delta_z)
# clipped_quotient = `[1 - numerator / (\Delta z)]_0^1` in Eq7.
# Ex: `clipped_quotient = [[[[ 1. 1. 1. 0. 0.]
# [ 0. 0. 0. 0. 0.]
# [ 0. 0. 0. 1. 0.]
# [ 0. 0. 0. 0. 0.]
# [ 0. 0. 0. 0. 1.]]
# [[ 1. 1. 1. 0. 0.]
# [ 0. 0. 0. 1. 0.]
# [ 0. 0. 0. 0. 1.]
# [ 0. 0. 0. 0. 0.]
# [ 0. 0. 0. 0. 0.]]]]`.
clipped_quotient = tf.clip_by_value(quotient, 0, 1)
# Ex: `weights = [[ 0.1 0.6 0.1 0.1 0.1]
# [ 0.1 0.2 0.5 0.1 0.1]]`.
weights = weights[:, None, :]
# inner_prod = `\sum_{j=0}^{N-1} clipped_quotient * p_j(x', \pi(x'))`
# in Eq7.
# Ex: `inner_prod = [[[[ 0.1 0.6 0.1 0. 0. ]
# [ 0. 0. 0. 0. 0. ]
# [ 0. 0. 0. 0.1 0. ]
# [ 0. 0. 0. 0. 0. ]
# [ 0. 0. 0. 0. 0.1]]
# [[ 0.1 0.2 0.5 0. 0. ]
# [ 0. 0. 0. 0.1 0. ]
# [ 0. 0. 0. 0. 0.1]
# [ 0. 0. 0. 0. 0. ]
# [ 0. 0. 0. 0. 0. ]]]]`.
inner_prod = clipped_quotient * weights
# Ex: `projection = [[ 0.8 0.0 0.1 0.0 0.1]
# [ 0.8 0.1 0.1 0.0 0.0]]`.
projection = tf.reduce_sum(inner_prod, 3)
projection = tf.reshape(projection, [batch_size, num_dims])
return projection提前谢谢你!
发布于 2021-07-07 18:37:37
如果您的图形处理器有任何问题,那么当您第一次运行脚本时,Tensorflow将通知您一个警告。
一般情况下,C51-DQN算法比DQN算法慢。这是因为需要更长的时间来计算动作奖励的分布,而不是动作的期望值。
此外,谷歌的多巴胺彩虹/C51实现比您的自定义实现更快,因为内存缓冲区直接连接到TF图。这意味着tensorflow不会浪费时间
相反,所有这些都是直接在内部完成的。
如果你想让你的程序变得更快,你可以做几件事:
@tf.function在图中添加所有计算(例如,状态的正向计算)。https://stackoverflow.com/questions/57487087
复制相似问题