首页> 外文会议>IEEE Conference on Decision and Control >Stochastic optimization of controlled partially observable Markov decision processes
【24h】

Stochastic optimization of controlled partially observable Markov decision processes

机译:受控部分可观察到的马尔可夫决策过程的随机优化

获取原文

摘要

We introduce an online algorithm for finding local maxima of the average reward in a partially observable Markov decision process (POMDP) controlled by a parameterized policy. Optimization is over the parameters of the policy. The algorithm's chief advantages are that it requires only a single sample path of the POMDP, it uses only one free parameter /spl beta//spl isin/(0, 1), which has a natural interpretation in terms of a bias-variance trade-off, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces. We prove almost-sure convergence of our algorithm, and show how the correct setting of /spl beta/ is related to the mixing time of the Markov chain induced by the POMDP.
机译:我们介绍一个在线算法,用于在由参数化策略控制的部分可观察的马尔可夫决策过程(POMDP)中查找平均奖励的本地最大值。 优化是策略的参数。 该算法的主要优点是它只需要POMDP的单个样本路径,它仅使用一个免费参数/SPLβ// SPL ISIN /(0,1),这在偏差贸易方面具有自然解释 -off,它不需要对潜在国家的了解。 另外,该算法可以应用于无限状态,控制和观察空间。 我们证明了算法的几乎肯定融合,并展示了如何正确设置/SPLβ/与POMDP引起的马尔可夫链的混合时间有关。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号