Traditionally, the Reinforcement Learning (RL) problem is presented as follows: An agent exists in an environment described by some set of possible states S, where it can perform any set of actions A. Each time it performs an action a{sub}t ∈ A in some state s{sub}t ∈ S the agent received a real-valued reward r{sub}t that indicates the immediate value of this state-action transition, This produces a sequence of states, actions, and immediate rewards. The agent's task is to learn a control policy, π:S→A, that maximizes the expected sum of rewards, typically with future rewards discounted exponentially by their delay. Unlike supervised learning, the learner is not told which actions to take, but instead must discover which actions yield the most reward by exploiting and exploring their relationship with the environment. Besides, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics, trial and error search and delayed reward, are the two major features of RL.
展开▼
机译:传统上,增强学习(RL)问题如下所示:在某些可能的状态S描述的环境中存在代理,在那里它可以执行任何一组动作A.每次执行动作A {sub} t ∈A在某些状态S {sub} t萱✍代理收到了一个真实值奖励r {sub} t,它表示此状态操作转换的立即值,这会产生一系列状态,操作和立即奖励。代理人的任务是学习控制策略,π:S→A,最大化奖励的预期总和,通常随着他们的延迟指数折扣的未来奖励。与监督学习不同,学习者没有被告知采取的行动,而是必须通过利用和探索与环境的关系产生最大奖励的行动。此外,行动不仅可能影响即时奖励,也可能影响下一个情况,并通过这一点,所有后续奖励。这两个特征,试验和错误搜索和延迟奖励是RL的两个主要特征。
展开▼