首页> 外文会议>International Conference on Algorithmic Learning Theory >Policy Gradients for CVaR-Constrained MDPs
【24h】

Policy Gradients for CVaR-Constrained MDPs

机译:CVAR约束MDP的政策梯度

获取原文

摘要

We study a risk-constrained version of the stochastic shortest path (SSP) problem, where the risk measure considered is Conditional Value-at-Risk (CVaR). We propose two algorithms that obtain a locally risk-optimal policy by employing four tools: stochastic approximation, mini batches, policy gradients and importance sampling. Both the algorithms incorporate a CVaR estimation procedure, along the lines of [3], which in turn is based on Rockafellar-Uryasev's representation for CVaR and utilize the likelihood ratio principle for estimating the gradient of the sum of one cost function (objective of the SSP) and the gradient of the CVaR of the sum of another cost function (constraint of the SSP). The algorithms differ in the manner in which they approximate the CVaR estimates/necessary gradients - the first algorithm uses stochastic approximation, while the second employs mini-batches in the spirit of Monte Carlo methods. We establish asymptotic convergence of both the algorithms. Further, since estimating CVaR is related to rare-event simulation, we incorporate an importance sampling based variance reduction scheme into our proposed algorithms.
机译:我们研究了随机最短路径(SSP)问题的风险约束版本,其中所考虑的风险措施是有条件的价值 - 风险(CVAR)。我们提出了两种算法,通过使用四个工具来获得当地风险最佳政策:随机近似,迷你批次,政策梯度和重要性采样。算法沿着[3]的线,算法包括CVAR估计程序,这又基于Rockafellar-Uryasev的CVAR的表示,并利用似然比原理来估计一个成本函数的总和的梯度(目标SSP)和另一种成本函数的总和的CVAR的梯度(SSP的约束)。该算法以近似CVAR估计/必要梯度的方式不同 - 第一算法使用随机近似,而第二种算法采用蒙特卡罗方法的精神使用迷你批次。我们建立了两种算法的渐近融合。此外,由于估计CVAR与稀有事件仿真有关,因此我们将基于重要的采样的方差减少方案纳入我们所提出的算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号