Policy Gradients for CVaR-Constrained MDPs

机译：受CVaR约束的MDP的策略梯度

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We study a risk-constrained version of the stochastic shortest path (SSP) problem, where the risk measure considered is Conditional Value-at-Risk (CVaR). We propose two algorithms that obtain a locally risk-optimal policy by employing four tools: stochastic approximation, mini batches, policy gradients and importance sampling. Both the algorithms incorporate a CVaR estimation procedure, along the lines of, which in turn is based on Rockafellar-Uryasev's representation for CVaR and utilize the likelihood ratio principle for estimating the gradient of the sum of one cost function (objective of the SSP) and the gradient of the CVaR of the sum of another cost function (constraint of the SSP). The algorithms differ in the manner in which they approximate the CVaR estimatesecessary gradients - the first algorithm uses stochastic approximation, while the second employs mini-batches in the spirit of Monte Carlo methods. We establish asymptotic convergence of both the algorithms. Further, since estimating CVaR is related to rare-event simulation, we incorporate an importance sampling based variance reduction scheme into our proposed algorithms.

机译：我们研究了随机最短路径（SSP）问题的风险约束版本，其中考虑的风险度量是条件风险值（CVaR）。我们提出了两种算法，可以通过使用四种工具来获得局部风险最优策略：随机逼近，小批量，策略梯度和重要性抽样。两种算法都结合了CVaR估计程序，依次基于Rockafellar-Uryasev对CVaR的表示，并利用似然比原理来估计一个成本函数之和（SSP的目标）的梯度。另一个成本函数之和（SSP的约束）之和的CVaR的梯度。这些算法在近似CVaR估计/必要梯度的方式上有所不同-第一种算法使用随机近似，而第二种算法则在蒙特卡洛方法的精神上采用了迷你批处理。我们建立两种算法的渐近收敛性。此外，由于估计CVaR与稀有事件模拟有关，因此我们将基于重要性抽样的方差减少方案纳入了我们提出的算法中。

著录项

来源
《International conference on algorithmic learning theory》|2014年|155-169|共15页
会议地点
作者
L.A. Prashanth;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Policy Evaluation in Continuous MDPs With Efficient Kernelized Gradient Temporal Difference [J] . Koppel Alec, Warnell Garrett, Stump Ethan, IEEE Transactions on Automatic Control . 2021,第4期

机译：连续MDP的政策评估，具有高效的脑级梯度时间差异
2. Policy Gradient SMDP for Resource Allocation and Routing in Integrated Services Networks [J] . Ngo Anh VIEN, Nguyen Hoang VIET, SeungGwan LEE, IEICE Transactions on Communications . 2009,第6期

机译：用于集成服务网络中资源分配和路由的策略梯度SMDP
3. Policy-Gradients for PSRs and POMDPs [J] . Douglas Aberdeen, Olivier Buffet, Owen Thomas JMLR: Workshop and Conference Proceedings . 2007,第2007期

机译：PSR和POMDP的策略等级
4. Policy Gradients for CVaR-Constrained MDPs [C] . L. A. Prashanth International Conference on Algorithmic Learning Theory . 2014

机译：CVAR约束MDP的政策梯度
5. Lip Synchronization for ECA Rendering with Self-Adjusted POMDP Policies [D] . Szucs, Tristan. 2019

机译：ECA渲染与自我调整POMDP政策的唇部同步
6. MDPs with Non-Deterministic Policies [O] . Mahdi Milani Fard, Joelle Pineau -1

机译：具有不确定性策略的MDP
7. Policy Gradients for CVaR-Constrained MDPs [O] . Prashanth L A 2014

机译：CVaR约束的mDp的策略梯度

Policy Gradients for CVaR-Constrained MDPs

摘要

著录项

相似文献

相关主题

期刊订阅