...
首页> 外文期刊>The Journal of Artificial Intelligence Research >Sampling Based Approaches for Minimizing Regret in Uncertain Markov Decision Processes (MDPs)
【24h】

Sampling Based Approaches for Minimizing Regret in Uncertain Markov Decision Processes (MDPs)

机译:在不确定的马尔可夫决策过程(MDP)中将后悔最小化的基于采样的方法

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Markov Decision Processes (MDPs) are an effective model to represent decision processes in the presence of transitional uncertainty and reward tradeoffs. However, due to the difficulty in exactly specifying the transition and reward functions in MDPs, researchers have proposed uncertain MDP models and robustness objectives in solving those models. Most approaches for computing robust policies have focused on the computation of maximin policies which maximize the value in the worst case amongst all realisations of uncertainty. Given the overly conservative nature of maximin policies, recent work has proposed minimax regret as an ideal alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only and they are also limited in their scalability. Therefore, we provide a general model of uncertain MDPs that considers uncertainty over both transition and reward functions. Furthermore, we also consider dependence of the uncertainty across different states and decision epochs. We also provide a mixed integer linear program formulation for minimizing regret given a set of samples of the transition and reward functions in the uncertain MDP. In addition, we provide two myopic variants of regret, namely Cumulative Expected Myopic Regret (CEMR) and One Step Regret (OSR) that can be optimized in a scalable manner. Specifically, we provide dynamic programming and policy iteration based algorithms to optimize CEMR and OSR respectively. Finally, to demonstrate the effectiveness of our approaches, we provide comparisons on two benchmark problems from literature. We observe that optimizing the myopic variants of regret, OSR and CEMR are better than directly optimizing the regret.
机译:马尔可夫决策过程(MDP)是在存在过渡不确定性和奖励折衷的情况下表示决策过程的有效模型。但是,由于难以准确指定MDP中的过渡和奖励功能,研究人员提出了不确定的MDP模型和解决这些模型的鲁棒性目标。用于计算鲁棒策略的大多数方法都集中在最大决策策略的计算上,该策略在所有不确定性实现中的最坏情况下使价值最大化。考虑到maximin策略过于保守的性质,最近的工作提出了minimax后悔作为鲁棒优化的maximin目标的理想替代方案。但是,现有的用于处理极大极小后悔的算法仅限于对奖励不确定的模型,并且其可扩展性也受到限制。因此,我们提供了不确定MDP的通用模型,该模型同时考虑了转移函数和报酬函数的不确定性。此外,我们还考虑了不同状态和决策时期之间不确定性的依赖性。我们还提供了一个混合整数线性程序公式,可以在给定不确定MDP中的过渡和奖励函数的样本的情况下最大程度地减少后悔。此外,我们提供了两种遗憾的近视变体,即可以通过可扩展方式优化的累积预期近视后悔(CEMR)和一步后悔(OSR)。具体来说,我们提供了动态编程和基于策略迭代的算法来分别优化CEMR和OSR。最后,为了证明我们方法的有效性,我们对文献中的两个基准问题进行了比较。我们观察到优化后悔,OSR和CEMR的近视变种比直接优化后悔更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号