Q学习-奖励
我很难解释Q学习算法的伪码:
1 For each s, a initialize table entry Q(a, s) = 0
2 Observe current state s
3 Do forever:
4 Select an action a and execute it
5 Receive immediate reward r
6 Observe the new state s′ ← δ(a, s)
7 Update the table entry for Q(a, s) as follows:
8 Q( a, s ) ← R( s ) + γ * max Q( a′, s′ )
9 s ← s′是从后续的状态s'还是当前的状态s中收集奖励?
发布于 2014-04-02 08:20:57
奖励应该从您在执行操作 a之后输入的后续状态中收集。
https://stackoverflow.com/questions/22805323
复制相似问题