首页> 外文会议>IEEE Conference on Decision and Control >Adversarial Multi-Armed Bandit Approach to Two-Person Zero-Sum Markov Games
【24h】

Adversarial Multi-Armed Bandit Approach to Two-Person Zero-Sum Markov Games

机译:对抗两人零和马尔可夫游戏的对抗多武装强盗方法

获取原文

摘要

A sampling-based algorithm for solving stochastic optimization problems, based on Auer et al.'s Exp3 algorithm for "adversarial multi-armed bandit problems," has been recently presented by the authors. In particular, the authors recursively extended the Exp3-based algorithm for solving finite-horizon Markov decision processes (MDPs) and analyzed its finite-iteration performance in terms of the expected bias relative to the maximum value of the "recursive sample-average-approximation (SAA)" problem induced by the sampling process in the algorithm, showing that the upper bound of the expected bias approaches zero as the sampling size per state sampled in each stage goes to infinity, leading to the convergence to the optimal value of the original MDP problem in the limit. As a sequel to the previous work, the idea is further extended for solving two-person zero-sum Markov games (MGs), providing a finite-iteration bound to the equilibrium value of the induced "recursive SAA game" problem and asymptotic convergence to the true equilibrium value. The recursively extended algorithm for MGs can be used for breaking the curse of dimensionality.
机译:一种基于Auer等人的解决随机优化问题的采样算法。作者最近介绍了“对抗多武装匪徒问题”的Exp3算法。特别是,作者递归地扩展了基于Exp3的算法来解决有限地平线马尔可夫决策过程(MDP),并在相对于“递归样本 - 平均近似值的最大值”中的预期偏差方面分析其有限迭代性能(SAA)“由算法中的采样过程引起的问题,显示预期偏置的上限作为每个阶段中采样的每个状态的采样尺寸转到无穷大,导致原始的最佳值的收敛MDP问题在限制。作为先前工作的续集,该想法进一步扩展了解决双人零和马尔可夫游戏(MGS),提供与诱导的“递归SAA游戏”问题和渐近收敛的均衡值绑定的有限迭代。真正的均衡价值。 MGS的递归扩展算法可用于破坏维度的诅咒。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号