Policy Gradients for CVaR-Constrained MDPs

机译：CVAR约束MDP的政策梯度

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We study a risk-constrained version of the stochastic shortest path (SSP) problem, where the risk measure considered is Conditional Value-at-Risk (CVaR). We propose two algorithms that obtain a locally risk-optimal policy by employing four tools: stochastic approximation, mini batches, policy gradients and importance sampling. Both the algorithms incorporate a CVaR estimation procedure, along the lines of [3], which in turn is based on Rockafellar-Uryasev's representation for CVaR and utilize the likelihood ratio principle for estimating the gradient of the sum of one cost function (objective of the SSP) and the gradient of the CVaR of the sum of another cost function (constraint of the SSP). The algorithms differ in the manner in which they approximate the CVaR estimates/necessary gradients - the first algorithm uses stochastic approximation, while the second employs mini-batches in the spirit of Monte Carlo methods. We establish asymptotic convergence of both the algorithms. Further, since estimating CVaR is related to rare-event simulation, we incorporate an importance sampling based variance reduction scheme into our proposed algorithms.

机译：我们研究了随机最短路径（SSP）问题的风险约束版本，其中所考虑的风险措施是有条件的价值 - 风险（CVAR）。我们提出了两种算法，通过使用四个工具来获得当地风险最佳政策：随机近似，迷你批次，政策梯度和重要性采样。算法沿着[3]的线，算法包括CVAR估计程序，这又基于Rockafellar-Uryasev的CVAR的表示，并利用似然比原理来估计一个成本函数的总和的梯度（目标SSP）和另一种成本函数的总和的CVAR的梯度（SSP的约束）。该算法以近似CVAR估计/必要梯度的方式不同 - 第一算法使用随机近似，而第二种算法采用蒙特卡罗方法的精神使用迷你批次。我们建立了两种算法的渐近融合。此外，由于估计CVAR与稀有事件仿真有关，因此我们将基于重要的采样的方差减少方案纳入我们所提出的算法。

著录项

来源
《International Conference on Algorithmic Learning Theory》|2014年||共15页
会议地点
作者
L. A. Prashanth;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP301.6-53;
关键词

相似文献

外文文献
中文文献
专利

1. Policy Evaluation in Continuous MDPs With Efficient Kernelized Gradient Temporal Difference [J] . Koppel Alec, Warnell Garrett, Stump Ethan, IEEE Transactions on Automatic Control . 2021,第4期

机译：连续MDP的政策评估，具有高效的脑级梯度时间差异
2. Policy Gradient SMDP for Resource Allocation and Routing in Integrated Services Networks [J] . Ngo Anh VIEN, Nguyen Hoang VIET, SeungGwan LEE, IEICE Transactions on Communications . 2009,第6期

机译：用于集成服务网络中资源分配和路由的策略梯度SMDP
3. Policy-Gradients for PSRs and POMDPs [J] . Douglas Aberdeen, Olivier Buffet, Owen Thomas JMLR: Workshop and Conference Proceedings . 2007,第2007期

机译：PSR和POMDP的策略等级
4. Policy Gradients for CVaR-Constrained MDPs [C] . L.A. Prashanth International conference on algorithmic learning theory . 2014

机译：受CVaR约束的MDP的策略梯度
5. Lip Synchronization for ECA Rendering with Self-Adjusted POMDP Policies [D] . Szucs, Tristan. 2019

机译：ECA渲染与自我调整POMDP政策的唇部同步
6. MDPs with Non-Deterministic Policies [O] . Mahdi Milani Fard, Joelle Pineau -1

机译：具有不确定性策略的MDP
7. Policy Gradients for CVaR-Constrained MDPs [O] . Prashanth L A 2014

机译：CVaR约束的mDp的策略梯度

Policy Gradients for CVaR-Constrained MDPs

摘要

著录项

相似文献

相关主题

期刊订阅