首页> 外文会议>IEEE Conference on Decision and Control >PAC Bounds for Simulation-based optimization of Markov Decision Processes
【24h】

PAC Bounds for Simulation-based optimization of Markov Decision Processes

机译:基于模拟的马尔可夫决策过程的PAC界限

获取原文

摘要

We generalize the PAC Learning framework for Markov Decision Processes developed in [18]. We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. We then propose a framework to obtain an ε-optimal policy from simulation. We provide sample complexity of such an approach.
机译:我们概括了[18]中开发的马尔可夫决策过程的PAC学习框架。我们考虑奖励功能,以依赖国家和行动。状态和行动空间都可能是无穷无尽的。我们获得了Markov决策过程的价值函数的估算,该过程分配给每个政策其预期的折扣奖励。可以估计这一预期奖励作为许多独立模拟运行的奖励的实证平均值。在策略类的V-C或伪维度方面,我们派生了对经验平均值的经验平均值收敛到预期奖励所需的界限。然后,我们提出了一个框架来获得仿真的ε-最佳策略。我们提供这种方法的样本复杂性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号