Q-Learning for Bandit Problems

机译：匪徒问题的Q学习

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Multi-armed bandits may be viewed as decompositionally-structured Markov decision processes (MDP's) with potentially very-large state sets. A particularly elegant methodology for computing optimal policies was developed over twenty ago by Gittins [Gittins & Jones, 1974]. Gittins' approach reduces the problem of finding optimal policies for the original MDP to a sequence of low-dimensional stopping problems whose solutions determine the optimal policy through the so-called "Gittins indices." Katehakis and Veinott [Katehakis & Veinott, 1987] have shown that the Gittins index for a process in state i may be interpreted as a particular component of the maximum-value function associated with the "restart-in-i" process, a simple MDP to which standard solution methods for computing optimal policies, such as successive approximation, apply. This paper explores the problem of learning the Gittins indices on-line without the aid of a process model; it suggests utilizing process-state-specific Q-learning agents to solve their respective restart-in-state-i subproblems, and includes an example in which the online reinforcement learning approach is applied to a problem of stochastic scheduling--one instance drawn from a wide class of problems that may be formulated as bandit problems.

机译：多臂匪可能被视为具有潜在非常大状态集的分解结构马尔可夫决策过程（MDP）。 Gittins于20年前开发了一种用于计算最佳策略的特别优雅的方法[Gittins＆Jones，1974]。 Gittins的方法将为原始MDP找到最佳策略的问题减少到一系列低维停止问题，这些问题的解决方案通过所谓的“ Gittins索引”确定了最佳策略。 Katehakis和Veinott [Katehakis＆Veinott，1987]表明，状态为i的过程的Gittins索引可以解释为与“ restart-in-i”过程（一个简单的MDP）相关的最大值函数的特定组成部分。计算最佳策略的标准解决方案方法（例如逐次逼近）适用于该方法。本文探讨了在不借助过程模型的情况下在线学习Gittins索引的问题。它建议利用特定于过程状态的Q学习代理来解决其各自的in-state-i子问题，并包括一个示例，其中将在线强化学习方法应用于随机调度问题-一个实例各种各样的问题，可以表述为强盗问题。

著录项

来源
《Machine learning(ML95) 》|1995年|p.209-217|共9页
会议地点 Tahoe City CA(US);Tahoe City CA(US)
作者
Michael O. Duff;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术 ;
关键词

相似文献

外文文献
中文文献
专利

1. Enhancing Nash Q-learning and Team Q-learning mechanisms by using bottlenecks [J] . Behzad Ghazanfari, Nasser Mozayani Journal of intelligent & fuzzy systems: Applications in Engineering and Technology . 2014 ,第6期

机译：通过使用瓶颈来增强Nash Q学习和团队Q学习机制
2. Backward Q-learning: The combination of Sarsa algorithm and Q-learning [J] . Yin-Hao Wang, Tzuu-Hseng S. Li, Chih-Jui Lin Engineering Applications of Artificial Intelligence . 2013 ,第9期

机译：向后Q学习：Sarsa算法和Q学习的结合
3. Top-$k$ Combinatorial Bandits with Full-Bandit Feedback [J] . Idan Rejwan, Yishay Mansour JMLR: Workshop and Conference Proceedings . 2020 ,第4期

机译：顶级$ k $组合式强盗，带有全强盗反馈
4. Comparing Multi-Armed Bandit Algorithms and Q-learning for Multiagent Action Selection: a Case Study in Route Choice [C] . Thiago B. F. de Oliveira, Ana L. C. Bazzan, Bruno C. da Silva, International Joint Conference on Neural Networks . 2018

机译：比较多武装强盗算法和Q学习的多主体行动选择：路线选择的案例研究
5. Adaptive Preference Learning with Bandit Feedback: Information Filtering, Dueling Bandits and Incentivizing Exploration [D] . Chen, Bangrui. 2017

机译：带有土匪反馈的自适应偏好学习：信息过滤，决斗土匪和激励探索
6. Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning [O] . Shota Ohnishi, Eiji Uchibe, Yotaro Yamaguchi, 2019

机译：受约束的深度Q学习逐渐接近普通Q学习
7. Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning [O] . Shota Ohnishi, Eiji Uchibe, Yotaro Yamaguchi, 2019

机译：约束深度Q学习逐渐接近普通Q-Learning

Q-Learning for Bandit Problems

摘要

著录项

相似文献

相关主题

期刊订阅