首页> 外文会议>Decision and Control, 2000. Proceedings of the 39th IEEE Conference on >Stochastic optimization of controlled partially observable Markov decision processes
【24h】

Stochastic optimization of controlled partially observable Markov decision processes

机译:受控局部可观马尔可夫决策过程的随机优化

获取原文

摘要

We introduce an online algorithm for finding local maxima of the average reward in a partially observable Markov decision process (POMDP) controlled by a parameterized policy. Optimization is over the parameters of the policy. The algorithm's chief advantages are that it requires only a single sample path of the POMDP, it uses only one free parameter /spl beta//spl isin/(0, 1), which has a natural interpretation in terms of a bias-variance trade-off, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces. We prove almost-sure convergence of our algorithm, and show how the correct setting of /spl beta/ is related to the mixing time of the Markov chain induced by the POMDP.
机译:我们介绍了一种在线算法,该算法可在通过参数化策略控制的部分可观察到的马尔可夫决策过程(POMDP)中找到平均奖励的局部最大值。优化超出了策略的参数。该算法的主要优势在于,它仅需要POMDP的单个样本路径,仅使用一个自由参数/ spl beta // spl isin /(0,1),这在偏差-方差交易方面具有很自然的解释。 -off,它不需要任何底层状态的知识。另外,该算法可以应用于无限状态,控制和观察空间。我们证明了我们算法的几乎确定的收敛性,并说明了/ spl beta /的正确设置与POMDP诱导的马尔可夫链的混合时间如何相关。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号