首页> 外文期刊>Computers & operations research >An empirical study of policy convergence in Markov decision process value iteration
【24h】

An empirical study of policy convergence in Markov decision process value iteration

机译:Markov决策过程值迭代中策略收敛的实证研究

获取原文
获取原文并翻译 | 示例
           

摘要

The value iteration algorithm is a well-known technique for generating solutions to discounted Markov decision process (MDP) models. Although simple to implement, the approach is nevertheless limited in situations where many Markov decision processes must be solved, such as in real-time state-based control problems or in simulation/optimization problems, because of the potentially large number of iterations required for the value function to converge to an e-optimal solution. Experimental results suggest, however, that the sequence of solution policies associated with each iteration of the algorithm converges much more rapidly than does the value function. This behavior has significant implications for designing solution approaches for MDPs, yet it has not been explicitly characterized in the literature nor generated significant discussion. This paper seeks to generate such discussion by providing comparative empirical convergence results and exploring several predictors that allow estimation of policy convergence speed based on existing MDP parameters.
机译:值迭代算法是一种众所周知的技术,用于生成折现马尔可夫决策过程(MDP)模型的解。尽管实施起来很简单,但是在必须解决许多Markov决策过程的情况下,例如由于基于状态的实时控制问题或在仿真/优化问题中,由于该方法可能需要大量迭代,因此该方法受到限制。值函数收敛到电子最优解。但是,实验结果表明,与算法的每次迭代关联的解决方案序列的收敛速度远快于价值函数。此行为对于设计MDP解决方案方法具有重大意义,但是在文献中并未对其进行明确描述,也未引起重大讨论。本文试图通过提供比较经验的趋同结果并探索一些允许基于现有MDP参数估计政策趋同速度的预测因子来引发此类讨论。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号