首页> 外文会议>International conference on algorithmic learning theory >Policy Gradients for CVaR-Constrained MDPs
【24h】

Policy Gradients for CVaR-Constrained MDPs

机译:受CVaR约束的MDP的策略梯度

获取原文

摘要

We study a risk-constrained version of the stochastic shortest path (SSP) problem, where the risk measure considered is Conditional Value-at-Risk (CVaR). We propose two algorithms that obtain a locally risk-optimal policy by employing four tools: stochastic approximation, mini batches, policy gradients and importance sampling. Both the algorithms incorporate a CVaR estimation procedure, along the lines of, which in turn is based on Rockafellar-Uryasev's representation for CVaR and utilize the likelihood ratio principle for estimating the gradient of the sum of one cost function (objective of the SSP) and the gradient of the CVaR of the sum of another cost function (constraint of the SSP). The algorithms differ in the manner in which they approximate the CVaR estimatesecessary gradients - the first algorithm uses stochastic approximation, while the second employs mini-batches in the spirit of Monte Carlo methods. We establish asymptotic convergence of both the algorithms. Further, since estimating CVaR is related to rare-event simulation, we incorporate an importance sampling based variance reduction scheme into our proposed algorithms.
机译:我们研究了随机最短路径(SSP)问题的风险约束版本,其中考虑的风险度量是条件风险值(CVaR)。我们提出了两种算法,可以通过使用四种工具来获得局部风险最优策略:随机逼近,小批量,策略梯度和重要性抽样。两种算法都结合了CVaR估计程序,依次基于Rockafellar-Uryasev对CVaR的表示,并利用似然比原理来估计一个成本函数之和(SSP的目标)的梯度。另一个成本函数之和(SSP的约束)之和的CVaR的梯度。这些算法在近似CVaR估计/必要梯度的方式上有所不同-第一种算法使用随机近似,而第二种算法则在蒙特卡洛方法的精神上采用了迷你批处理。我们建立两种算法的渐近收敛性。此外,由于估计CVaR与稀有事件模拟有关,因此我们将基于重要性抽样的方差减少方案纳入了我们提出的算法中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号