blocks|key|964680|text|type|unstyled|depth|inlineStyleRanges|entityRanges|data|entityMap^0^^$0|@$1|2|3|-4|4|5|6|B|7|@]|8|@]|9|$]]]|A|$]]

<blockquote>
 Does state representation generally affect how difficult a problem is? 
</blockquote>

Yes, due to it being easier or harder for a neural network to learn the relationships between input features and the target policy or value function.

<blockquote>
 Is this a poor state representation? 
</blockquote>

No, it should be fine, provided you are interested to see a RL learn to convert from cartesian coordinate space to solving this problem, you should be able to use the state directly as ML features. 

It could be made into better features for a NN in two ways:

<ul>
<li>Scaling to fit in ranges with zero mean and limited max absolute values.</li>
<li>It could be engineered so that features include domain knowledge of the problem to be solved. </li>
</ul>

The first part is optional for you, but I think that the bearing values benefit a little from being scaled and centred. In a more general case, even if the state description was complete, you may still want a step that re-scales it.

The second part is tricky - yes you could find better features, but part of the challenge is to create an agent that learns to do this itself. In this case, you may want to gain experience training agents where it is less easy to engineer "golden" features, and focus on RL methods.

<blockquote>
 Are there rules of thumb for how to design states? 
</blockquote>

Conceptually, the problem of state representation can break down into three parts:

<ul>
<li>Observations. Raw observations are not always direct candidates for a state representation. In a toy problem, often this is ignored and you feed data from a simulation that looks useful as a state. In the real world you can be limited by what is detectable.</li>
<li>State description. Unless you want to explore POMDPs then you usually want the state description to possess the <a href="https://en.wikipedia.org/wiki/Markov_property" rel="nofollow noreferrer">Markov property</a>. This might already mean processing observations into something else - e.g. using history of last 3 observations, keeping running totals or calculating differences.</li>
<li>Feature vector. Once you have decided to use function approximation for calculating action values or a policy function, then your inputs need to conform to how the function approximation works. For most function approximators, this means numerical values. For neural networks, it means scaling inputs to fit relatively small ranges. There is also the question of feature engineering using domain knowledge of the problem.</li>
</ul>

There is some overlap between these design steps, the distinctions are somewhat artificial. When you are designing a test problem like yours, you may find it simple to combine all three steps into a single observation = state = feature. However, in real-world problems, each of the steps may require some consideration.

You should also consider the likely nature/shape of the function that you will be approximating. Action-value methods like Q learning need to approximate their value functions, whilst policy gradient methods like REINFORCE and PPO need to approximate policies. Sometimes the map between input features and the target function is simple, and if you are lucky some intuition can lead you to figuring that out. This is also a big driver when choosing between DQN or PPO for example - what seems easier to figure out, the correct action, or the value of a state/action pair?

<blockquote>
 Would it help to reformulating the state, eg. to [distance_from_agent_to_rabbit, angle_between_agent_and_rabbit]?
</blockquote>

Maybe. The angle feature that would seem to most help is the difference in angle between the wolf's bearing and the vector between the wolf and rabbit. Then the correct action would map very clearly from that to steer the wolf towards the rabbit - in most cases a negative value means steer right and a positive value means steer left.

However, if you do this, you will in some ways have changed the nature of the problem to be solved. You have to ask, are you interested in applying your domain knowledge of the problem to help the agent, or are you interested to see if the agent can figure out an internal representation that discovers this relationship? For a toy problem, you may want to deliberately make something harder to learn.

<blockquote>
 I am having trouble solving this environment. Agents trained in it get scores only slightly better than a random agent even after long training sessions. I have tried Deep Q-learning (with experience replay, target network) REINFORCE (with and without baseline) and PPO.
</blockquote>

As discussed in comments, the biggest factor towards this actually turned out to be a bug in your environment code, where the agent could end up with a bearing value out of range. The environment still worked, because you are using trig functions to calculate movement, but this causes even larger range of bearing values plus makes states that are identical appear different to the agent making things even harder to learn.

<h3>An experiment</h3>

I managed to solve this environment using a simple single-step DQN-based agent, and had time to experiment with some different input features to the NN to demonstrate my points above. In each case, I used exactly the same hyperparameters for RL and NN (expect in the last case I had to change the size of the NN's input layer). 

I counted the number of training episodes required (including ~80 episodes of purely random behaviour to start experience replay) for 100 completed training runs, and tried some different state feature representations. I counted the environment as "solved" when 100 test runs using the agent acting greedily scored an average return of 20 or more. I did not count the test runs towards the number of training episodes.

I got these results:

<ul>
<li>Unaltered state 629.99 +-23.66 episodes, but failed in 28 out of 100</li>
<li>Scaled state 577.78 +-19.41 episodes, no failures</li>
<li>Engineered state 153.84 +-3.26 episodes, no failures</li>
</ul>

There is not quite a significant difference in 100 trials between number of episodes for unscaled and scaled state. However, the high number of failures (gave up after 1500 episodes of training) for unscaled features is a significant difference.

For the scaled state variation, The scaling for the cartesian coordinates was $f_i = 2*(s_i-0.5)$ and the scaling for the bearings was $f_i = 0.5 * (s_i-\pi)$

The engineered state used [distance between wolf and rabbit, angle between vector to rabbit and wolf's bearing, rabbit's bearing] scaled as above, and the performance was radically better.

<h3>Another consideration</h3>

As an aside, it is worth mentioning episode timeout and "done" flag. 

You need to carefully consider what it means for the episode to time out. There is a difference between this being part of the challenge (to succeed within a time limit) and being for training convenience (to avoid wasting time learning from overlong or stuck episodes). Sometimes it is a big difference:

<ul>
<li>For training convenience. The "done" flag is an annoyance here, you want to avoid claiming the episode is really over to the agent, as it will falsely learn that some states, sometimes, end the episode - these states may then even be desirable to the agent as they seem like they stop the flow of negative reward. You don't want to store that white lie in the experience replay table - a simple work-around is to have your agent stop prematurely at least 1 step before the environment times out. </li>
<li>Part of the challenge. If the environment can really time out, and not randomly stop at any point, then in order to preserve the <a href="https://en.wikipedia.org/wiki/Markov_property" rel="nofollow noreferrer">Markov property</a>, you must include a representation of the time in the state - it can be time so far or remaining time. Otherwise you have turned the problem into a POMDP, and added complications for calculating value functions.</li>
</ul>

I have created a simple OpenAI Gym environment, which consists of:

<ul>
<li>A continuous 2D world with x and y in range [0.0, 1.0]</li>
<li>A rabbit which slowly moves randomly in the world with a constant speed</li>
<li>A 'wolf' controlled by the agent. The wolf moves at a constant speed</li>
<li>The actions are [turn left by a constant angle, turn right by a constant angle, do nothing (continue straight)]</li>
<li>The state is [agent_x, agent_y, agent_bearing, rabbit_x, rabbit_y, rabbit_bearning]. Bearings are in radians [0.0, 2*pi]. All values are floating point numbers.</li>
<li>The reward is 30 for catching the rabbit (catching means the agent getting sufficiently close to the rabbit). -0.1 for each timestep without catching the rabbit.</li>
<li>Maximum timesteps 260</li>
</ul>

I am having trouble solving this environment. Agents trained in it get scores only slightly better than a random agent even after long training sessions. I have tried Deep Q-learning (with experience replay, target network) REINFORCE (with and without baseline) and PPO.

Conceptually, the problem is quite simple. The agent just needs to learn to turn towards the rabbit. However, it occurs to me that the state representation might make the problem more difficult, since only one of the six variables is directly under the agent's control, and three of them (the rabbit state) are completely random.

Does state representation generally affect how difficult a problem is? Is this a poor state representation? Are there rules of thumb for how to design states? Would it help to reformulating the state, eg. to [distance_from_agent_to_rabbit, angle_between_agent_and_rabbit]?

Environment source code <a href="https://pastebin.com/VRQwJ39S" rel="nofollow noreferrer">here</a>.

Reinforcement learning: easily learnable state representation

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

EdgeOne AI 安全实战专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我创建了一个简单的OpenAI健身房环境，它包括：一个范围为x和y的连续二维世界一只在世界上以恒定的速度缓慢地随机移动的兔子由特工控制的“狼”。狼以恒定的速度移动。这些行动是这个州是。轴承是弧度的。所有的值都是浮点数。抓到兔子的奖励是30 (捕获意味着代理人离兔子足够近)。-每个时间步骤0.1，但不捕捉兔子。最大时间步骤260我在解决这个环境上有困难。即使经过长时间的训练，在这方面训练的特工的得分

问强化学习:易于学习的状态表示
EN

回答 1

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问强化学习:易于学习的状态表示EN