首页> 外文会议>Machine learning(ML95) >Q-Learning for Bandit Problems
【24h】

Q-Learning for Bandit Problems

机译:匪徒问题的Q学习

获取原文
获取原文并翻译 | 示例

摘要

Multi-armed bandits may be viewed as decompositionally-structured Markov decision processes (MDP's) with potentially very-large state sets. A particularly elegant methodology for computing optimal policies was developed over twenty ago by Gittins [Gittins & Jones, 1974]. Gittins' approach reduces the problem of finding optimal policies for the original MDP to a sequence of low-dimensional stopping problems whose solutions determine the optimal policy through the so-called "Gittins indices." Katehakis and Veinott [Katehakis & Veinott, 1987] have shown that the Gittins index for a process in state i may be interpreted as a particular component of the maximum-value function associated with the "restart-in-i" process, a simple MDP to which standard solution methods for computing optimal policies, such as successive approximation, apply. This paper explores the problem of learning the Gittins indices on-line without the aid of a process model; it suggests utilizing process-state-specific Q-learning agents to solve their respective restart-in-state-i subproblems, and includes an example in which the online reinforcement learning approach is applied to a problem of stochastic scheduling--one instance drawn from a wide class of problems that may be formulated as bandit problems.
机译:多臂匪可能被视为具有潜在非常大状态集的分解结构马尔可夫决策过程(MDP)。 Gittins于20年前开发了一种用于计算最佳策略的特别优雅的方法[Gittins&Jones,1974]。 Gittins的方法将为原始MDP找到最佳策略的问题减少到一系列低维停止问题,这些问题的解决方案通过所谓的“ Gittins索引”确定了最佳策略。 Katehakis和Veinott [Katehakis&Veinott,1987]表明,状态为i的过程的Gittins索引可以解释为与“ restart-in-i”过程(一个简单的MDP)相关的最大值函数的特定组成部分。计算最佳策略的标准解决方案方法(例如逐次逼近)适用于该方法。本文探讨了在不借助过程模型的情况下在线学习Gittins索引的问题。它建议利用特定于过程状态的Q学习代理来解决其各自的in-state-i子问题,并包括一个示例,其中将在线强化学习方法应用于随机调度问题-一个实例各种各样的问题,可以表述为强盗问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号