文章/答案/技术大牛

发布

社区首页 >问答首页 >C51强化学习算法速度极慢

问C51强化学习算法速度极慢
EN

Stack Overflow用户

提问于 2019-08-14 09:28:32

回答 1查看 275关注 0票数 0

我正在将强化学习应用于时间序列预测问题。到目前为止，我已经用LSTM实现了一个决斗DDQN算法，它似乎给出了一些很好的结果，尽管有时收敛速度很慢，这取决于确切的问题。然后，我使用C51分布式强化学习来比较性能(我希望这会带来更好的结果)。

我稍微修改了谷歌代码dopamine，将其集成到我的代码中(网络和培训部分)。我还使用了双Q学习来选择下一个状态动作(原始代码没有使用)。但是，问题是它执行起来真的非常非常慢。相比之下，我之前的决斗DDQN过去需要3.5h才能训练50000集，而C51算法现在花了近10个小时，但只达到了3000集。

我想知道我对代码的适应是否有问题，或者C51算法是否真的那么慢。我使用的是NVidia Geforce RTX 2080Ti。

以下是网络部分：

#network part
self.weights_initializer = tf.contrib.slim.variance_scaling_initializer(factor=1.0 / np.sqrt(3.0), mode='FAN_IN', uniform=True)
self.net = tf.contrib.slim.fully_connected(
  self.rnn, # output of an LSTM
  num_actions * num_atoms,
  activation_fn=None,
  weights_initializer=self.weights_initializer)

self.logits = tf.reshape(self.net, [-1, num_actions, num_atoms])
self.probabilities = tf.contrib.layers.softmax(self.logits)
self.q_values = tf.reduce_sum(self._support * self.probabilities, axis=2)

self.predict = tf.argmax(self.q_values,1)

self.actions = tf.placeholder(shape=[None],dtype=tf.int32)    

self.target_distribution = tf.placeholder(shape=[None,num_atoms],dtype=tf.float32)

# size of indices: batch_size x 1.
self.indices = tf.range(tf.shape(self.logits)[0])[:, None]
# size of reshaped_actions: batch_size x 2.
self.reshaped_actions = tf.concat([self.indices, self.actions[:, None]], 1)
# For each element of the batch, fetch the logits for its selected action.
self.chosen_action_logits = tf.gather_nd(self.logits,
                                self.reshaped_actions)

self.td_error = tf.nn.softmax_cross_entropy_with_logits(labels=self.target_distribution,logits=self.chosen_action_logits)


# divide by the real length of episodes instead of averaging which is incorrect
self.loss = tf.cast(tf.reduce_sum(self.td_error), tf.float64) / tf.cast(tf.reduce_sum(self.seq_len), tf.float64)

if apply_grad_clipping:
   # calculate gradients and clip them to handle outliers
   tvars = tf.trainable_variables()
   grads, _ = tf.clip_by_global_norm(tf.gradients(self.loss, tvars), grad_clipping)
   self.updateModel = optimizer.apply_gradients(
        zip(grads, tvars),
        name="updateModel")
else:
   self.updateModel = optimizer.minimize(self.loss, name="updateModel")

下面是训练部分：

# training part
if i >= pre_train_episodes:
        #Reset the lstm's hidden state
        state_train = np.zeros((num_layers, 2, batch_size, h_size))
        #Get a random batch of experiences.
        trainBatch = myBuffer.sample(batch_size)
        #Below we perform the Double-DQN update to the target Q-values
        num_samples = batch_size*trace_length
        # size of rewards: batch_size x 1
        rewards = trainBatch[:,2][:, None]

        # size of tiled_support: batch_size x num_atoms
        tiled_support = tf.tile(mainQN._support, [num_samples])
        tiled_support = tf.reshape(tiled_support, [num_samples, num_atoms])

        # size of target_support: batch_size x num_atoms
        is_terminal_multiplier = -(np.array(trainBatch[:,4]) - 1)
        # Incorporate terminal state to discount factor.
        # size of gamma_with_terminal: batch_size x 1
        gamma_with_terminal = gamma * is_terminal_multiplier
        gamma_with_terminal = gamma_with_terminal[:, None]

        target_support = rewards + gamma_with_terminal * tiled_support


        next_qt_argmax = sess.run([mainQN.predict], feed_dict={\
                            mainQN.scalarInput:np.vstack(trainBatch[:,3]),\
                            mainQN.trainLength:trace_length,mainQN.state_in:state_train,mainQN.batch_size:batch_size})
        next_qt_argmax = np.reshape(next_qt_argmax, [-1, 1])
        probabilities = sess.run(targetQN.probabilities, feed_dict={\
                            targetQN.scalarInput:np.vstack(trainBatch[:,3]),\
                            targetQN.trainLength:trace_length,targetQN.state_in:state_train,targetQN.batch_size:batch_size})
        batch_indices = np.arange(num_samples)[:, None]
        batch_indexed_next_qt_argmax = np.concatenate([batch_indices, next_qt_argmax], axis=1)


        # size of next_probabilities: batch_size x num_atoms
        next_probabilities = tf.gather_nd(probabilities, batch_indexed_next_qt_argmax)


        target_distribution = project_distribution(target_support, next_probabilities, mainQN._support)
        target_distribution = target_distribution.eval()

        loss, _, _ = sess.run([mainQN.loss, mainQN.check_ops, mainQN.updateModel], \
                            feed_dict={mainQN.scalarInput:np.vstack(trainBatch[:,0]),mainQN.target_distribution:target_distribution,\
                            mainQN.actions:trainBatch[:,1],mainQN.trainLength:trace_length,\
                            mainQN.state_in:state_train,mainQN.batch_size:batch_size})

        # perform soft/hard update frequently
        if i % update_target_freq == 0 or update_target_freq == 1 or softUpdate == True:
            updateTarget(targetOps,sess)

辅助功能：

# function used above to project the distribution on the provided support
def project_distribution(supports, weights, target_support,
                     validate_args=False):
"""Projects a batch of (support, weights) onto target_support.
Based on equation (7) in (Bellemare et al., 2017):
https://arxiv.org/abs/1707.06887
In the rest of the comments we will refer to this equation simply as Eq7.
This code is not easy to digest, so we will use a running example to  clarify
what is going on, with the following sample inputs:
* supports =       [[0, 2, 4, 6, 8],
                    [1, 3, 4, 5, 6]]
* weights =        [[0.1, 0.6, 0.1, 0.1, 0.1],
                    [0.1, 0.2, 0.5, 0.1, 0.1]]
* target_support = [4, 5, 6, 7, 8]
In the code below, comments preceded with 'Ex:' will be referencing the above
values.
Args:
supports: Tensor of shape (batch_size, num_dims) defining supports for the
  distribution.
weights: Tensor of shape (batch_size, num_dims) defining weights on the
  original support points. Although for the CategoricalDQN agent these
  weights are probabilities, it is not required that they are.
target_support: Tensor of shape (num_dims) defining support of the projected
  distribution. The values must be monotonically increasing. Vmin and Vmax
  will be inferred from the first and last elements of this tensor,
  respectively. The values in this tensor must be equally spaced.
  validate_args: Whether we will verify the contents of the
  target_support parameter.
  Returns:
     A Tensor of shape (batch_size, num_dims) with the projection of a batch of
(support, weights) onto target_support.
  Raises:
    ValueError: If target_support has no dimensions, or if shapes of supports,
  weights, and target_support are incompatible.
 """
target_support_deltas = target_support[1:] - target_support[:-1]
# delta_z = `\Delta z` in Eq7.
delta_z = target_support_deltas[0]
validate_deps = []
supports.shape.assert_is_compatible_with(weights.shape)
supports[0].shape.assert_is_compatible_with(target_support.shape)
target_support.shape.assert_has_rank(1)
if validate_args:
# Assert that supports and weights have the same shapes.
validate_deps.append(
    tf.Assert(
        tf.reduce_all(tf.equal(tf.shape(supports), tf.shape(weights))),
        [supports, weights]))
# Assert that elements of supports and target_support have the same shape.
validate_deps.append(
    tf.Assert(
        tf.reduce_all(
            tf.equal(tf.shape(supports)[1], tf.shape(target_support))),
        [supports, target_support]))
# Assert that target_support has a single dimension.
validate_deps.append(
    tf.Assert(
        tf.equal(tf.size(tf.shape(target_support)), 1), [target_support]))
# Assert that the target_support is monotonically increasing.
validate_deps.append(
    tf.Assert(tf.reduce_all(target_support_deltas > 0), [target_support]))
# Assert that the values in target_support are equally spaced.
validate_deps.append(
    tf.Assert(
        tf.reduce_all(tf.equal(target_support_deltas, delta_z)),
        [target_support]))

with tf.control_dependencies(validate_deps):
# Ex: `v_min, v_max = 4, 8`.
v_min, v_max = target_support[0], target_support[-1]
# Ex: `batch_size = 2`.
batch_size = tf.shape(supports)[0]
# `N` in Eq7.
# Ex: `num_dims = 5`.
num_dims = tf.shape(target_support)[0]
# clipped_support = `[\hat{T}_{z_j}]^{V_max}_{V_min}` in Eq7.
# Ex: `clipped_support = [[[ 4.  4.  4.  6.  8.]]
#                         [[ 4.  4.  4.  5.  6.]]]`.
clipped_support = tf.clip_by_value(supports, v_min, v_max)[:, None, :]
# Ex: `tiled_support = [[[[ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]]
#                        [[ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]]]]`.
tiled_support = tf.tile([clipped_support], [1, 1, num_dims, 1])
# Ex: `reshaped_target_support = [[[ 4.]
#                                  [ 5.]
#                                  [ 6.]
#                                  [ 7.]
#                                  [ 8.]]
#                                 [[ 4.]
#                                  [ 5.]
#                                  [ 6.]
#                                  [ 7.]
#                                  [ 8.]]]`.
reshaped_target_support = tf.tile(target_support[:, None], [batch_size, 1])
reshaped_target_support = tf.reshape(reshaped_target_support,
                                     [batch_size, num_dims, 1])
# numerator = `|clipped_support - z_i|` in Eq7.
# Ex: `numerator = [[[[ 0.  0.  0.  2.  4.]
#                     [ 1.  1.  1.  1.  3.]
#                     [ 2.  2.  2.  0.  2.]
#                     [ 3.  3.  3.  1.  1.]
#                     [ 4.  4.  4.  2.  0.]]
#                    [[ 0.  0.  0.  1.  2.]
#                     [ 1.  1.  1.  0.  1.]
#                     [ 2.  2.  2.  1.  0.]
#                     [ 3.  3.  3.  2.  1.]
#                     [ 4.  4.  4.  3.  2.]]]]`.
numerator = tf.abs(tiled_support - reshaped_target_support)
quotient = 1 - (numerator / delta_z)
# clipped_quotient = `[1 - numerator / (\Delta z)]_0^1` in Eq7.
# Ex: `clipped_quotient = [[[[ 1.  1.  1.  0.  0.]
#                            [ 0.  0.  0.  0.  0.]
#                            [ 0.  0.  0.  1.  0.]
#                            [ 0.  0.  0.  0.  0.]
#                            [ 0.  0.  0.  0.  1.]]
#                           [[ 1.  1.  1.  0.  0.]
#                            [ 0.  0.  0.  1.  0.]
#                            [ 0.  0.  0.  0.  1.]
#                            [ 0.  0.  0.  0.  0.]
#                            [ 0.  0.  0.  0.  0.]]]]`.
clipped_quotient = tf.clip_by_value(quotient, 0, 1)
# Ex: `weights = [[ 0.1  0.6  0.1  0.1  0.1]
#                 [ 0.1  0.2  0.5  0.1  0.1]]`.
weights = weights[:, None, :]
# inner_prod = `\sum_{j=0}^{N-1} clipped_quotient * p_j(x', \pi(x'))`
# in Eq7.
# Ex: `inner_prod = [[[[ 0.1  0.6  0.1  0.  0. ]
#                      [ 0.   0.   0.   0.  0. ]
#                      [ 0.   0.   0.   0.1 0. ]
#                      [ 0.   0.   0.   0.  0. ]
#                      [ 0.   0.   0.   0.  0.1]]
#                     [[ 0.1  0.2  0.5  0.  0. ]
#                      [ 0.   0.   0.   0.1 0. ]
#                      [ 0.   0.   0.   0.  0.1]
#                      [ 0.   0.   0.   0.  0. ]
#                      [ 0.   0.   0.   0.  0. ]]]]`.
inner_prod = clipped_quotient * weights
# Ex: `projection = [[ 0.8 0.0 0.1 0.0 0.1]
#                    [ 0.8 0.1 0.1 0.0 0.0]]`.
projection = tf.reduce_sum(inner_prod, 3)
projection = tf.reshape(projection, [batch_size, num_dims])
return projection

提前谢谢你！

python

tensorflow

machine-learning

reinforcement-learning

q-learning

回答 1

Stack Overflow用户

发布于 2021-07-07 18:37:37

如果您的图形处理器有任何问题，那么当您第一次运行脚本时，Tensorflow将通知您一个警告。

一般情况下，C51-DQN算法比DQN算法慢。这是因为需要更长的时间来计算动作奖励的分布，而不是动作的期望值。

此外，谷歌的多巴胺彩虹/C51实现比您的自定义实现更快，因为内存缓冲区直接连接到TF图。这意味着tensorflow不会浪费时间

RAM from memory (内存)(从内存中检索经验)

将numpy数组转换为张量

进行计算和连接列

将结果提供给网络。

相反，所有这些都是直接在内部完成的。

如果你想让你的程序变得更快，你可以做几件事：

将体验存储在TF变量中，而不是内存中。

使用@tf.function在图中添加所有计算(例如，状态的正向计算)。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57487087

复制

相似问题

问C51强化学习算法速度极慢
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问C51强化学习算法速度极慢EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问C51强化学习算法速度极慢
EN