目前,我正在尝试学习强化学习的概念。因此,我尝试使用tensorflow为cart pole示例实现SARSA算法。我将我的算法与对Q值函数使用线性逼近函数的算法进行了比较,发现我的算法非常相似。不幸的是,我的实现似乎是错误的或低效的,因为学习成功相当有限。有没有人能告诉我我是不是做错了什么?我的实现代码是:
import numpy as np
import matplotlib.pylab as plt
import random
import gym
#define a neural network which returns two action dependent q-values given a state
neural_net = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation = 'relu', input_shape = [4]),
tf.keras.layers.Dense(2)
])
#return the neural network's q-value for a specific action
def q_value(state, action):
return neural_net(tf.convert_to_tensor([state]))[0, action]
#act either randomly or choose the action which maximizes the q-value
def policy(state, epsilon):
values = neural_net(tf.convert_to_tensor([state]))
if np.random.rand() < epsilon:
return random.choice([0, 1])
else:
return np.argmax(values)
#intialize gym environment
env = gym.make('CartPole-v0')
#hyperparameters
discount = 0.99
optimizer = tf.keras.optimizers.Adam()
episodes = 1000
epsilon = 0.30
#collect reward for each episode
rewards = []
for episode in range(episodes):
#start trajectory for episode
state = env.reset()
#record rewards during episode
sum_returns = 0
#decrease random action after the first 100 episodes
if episode == 100:
epsilon = 0.10
#Q-learning
while True:
action = policy(state, epsilon)
next_state, reward, done, _ = env.step(action)
next_action = policy(next_state, epsilon)
sum_returns += 1
if done:
with tf.GradientTape() as tape:
tape.watch(neural_net.trainable_variables)
q_hat = q_value(state, action)
y = reward
loss = tf.square(y - q_hat)
gradients = tape.gradient(loss, neural_net.trainable_variables)
optimizer.apply_gradients(zip(gradients, neural_net.trainable_variables))
break
else:
with tf.GradientTape() as tape:
tape.watch(neural_net.trainable_variables)
q_hat = q_value(state, action)
y = reward + discount * q_value(next_state, next_action)
loss = tf.square(y - q_hat)
gradients = tape.gradient(loss, neural_net.trainable_variables)
optimizer.apply_gradients(zip(gradients, neural_net.trainable_variables))
state = next_state
rewards.append(sum_returns)
#plot learning over time
plt.plot([episode for episode in range(episodes)], rewards)
plt.show()```发布于 2021-06-04 13:16:58
我快速浏览了一下您的代码,似乎神经网络无法知道Q值y的新估计值与选择的操作a有关,因为您通过了相同的状态,而不管随后做出的选择。
我的建议是将状态连接两次-因为您有两个操作-当您选择第一个操作时,您只将当前状态添加到此中间表示的前半部分,而将向量的后半部分保留为空。在选择第二个操作时,您会执行类似的操作,但会将向量的前半部分保留为空。
请查看我在实现SARSA时在Coursera上找到的链接:https://www.coursera.org/lecture/prediction-control-function-approximation/episodic-sarsa-with-function-approximation-z9xQJ
https://stackoverflow.com/questions/65218524
复制相似问题