首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用tensorflow实现SARSA

使用tensorflow实现SARSA
EN

Stack Overflow用户
提问于 2020-12-09 22:15:38
回答 1查看 199关注 0票数 0

目前,我正在尝试学习强化学习的概念。因此,我尝试使用tensorflow为cart pole示例实现SARSA算法。我将我的算法与对Q值函数使用线性逼近函数的算法进行了比较,发现我的算法非常相似。不幸的是,我的实现似乎是错误的或低效的,因为学习成功相当有限。有没有人能告诉我我是不是做错了什么?我的实现代码是:

代码语言:javascript
复制
import numpy as np
import matplotlib.pylab as plt
import random
import gym


#define a neural network which returns two action dependent q-values given a state
neural_net = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation = 'relu', input_shape = [4]),
    tf.keras.layers.Dense(2)
])

#return the neural network's q-value for a specific action
def q_value(state, action):
    return neural_net(tf.convert_to_tensor([state]))[0, action]

#act either randomly or choose the action which maximizes the q-value
def policy(state, epsilon):
    values = neural_net(tf.convert_to_tensor([state]))
    if np.random.rand() < epsilon:
        return random.choice([0, 1])
    else:
        return np.argmax(values)

#intialize gym environment
env = gym.make('CartPole-v0')

#hyperparameters
discount = 0.99
optimizer = tf.keras.optimizers.Adam()
episodes = 1000
epsilon = 0.30

#collect reward for each episode
rewards = []
for episode in range(episodes):

    #start trajectory for episode
    state = env.reset()

    #record rewards during episode
    sum_returns = 0

    #decrease random action after the first 100 episodes
    if episode == 100:
        epsilon = 0.10

    #Q-learning
    while True:
        action = policy(state, epsilon)
        next_state, reward, done, _ = env.step(action)
        next_action = policy(next_state, epsilon)
        sum_returns += 1

        if done:
            with tf.GradientTape() as tape:
                tape.watch(neural_net.trainable_variables)
                q_hat = q_value(state, action)
                y = reward
                loss = tf.square(y - q_hat)

            gradients = tape.gradient(loss, neural_net.trainable_variables)
            optimizer.apply_gradients(zip(gradients, neural_net.trainable_variables))
            break
        else:
            with tf.GradientTape() as tape:
                tape.watch(neural_net.trainable_variables)
                q_hat = q_value(state, action)
                y = reward + discount * q_value(next_state, next_action)
                loss = tf.square(y - q_hat)

            gradients = tape.gradient(loss, neural_net.trainable_variables)
            optimizer.apply_gradients(zip(gradients, neural_net.trainable_variables))
            state = next_state

    rewards.append(sum_returns)

#plot learning over time
plt.plot([episode for episode in range(episodes)], rewards)
plt.show()```
EN

回答 1

Stack Overflow用户

发布于 2021-06-04 13:16:58

我快速浏览了一下您的代码,似乎神经网络无法知道Q值y的新估计值与选择的操作a有关,因为您通过了相同的状态,而不管随后做出的选择。

我的建议是将状态连接两次-因为您有两个操作-当您选择第一个操作时,您只将当前状态添加到此中间表示的前半部分,而将向量的后半部分保留为空。在选择第二个操作时,您会执行类似的操作,但会将向量的前半部分保留为空。

请查看我在实现SARSA时在Coursera上找到的链接:https://www.coursera.org/lecture/prediction-control-function-approximation/episodic-sarsa-with-function-approximation-z9xQJ

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65218524

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档