Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

Mohammed Shahid Abdulla; Shalabh Bhatnagar

首页> 外文期刊>Discrete Event Dynamic Systems >Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

【24h】

Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

机译：基于增强学习的平均成本马尔可夫决策过程算法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).

机译：本文提出了几种基于两时间尺度仿真的actor-critic算法，用于在平均成本准则下求解具有有限状态空间的无限地平线马尔可夫决策过程。其中两种算法用于紧凑（非离散）动作设置，而其余算法则用于有限动作空间。在较慢的时间尺度上，所有算法都使用两个不同的同时扰动随机近似（SPSA）梯度估计值在相应的策略空间上执行梯度搜索。在更快的时间尺度上，将更新与给定固定策略相对应的差异成本函数，并执行附加平均以增强性能。提出了收敛到局部最优策略的证明。接下来，我们讨论一种内存有效的实现，该实现使用状态空间的基于特征的表示并沿着更快的时标执行TD（0）学习。 TD（0）算法不遵循状态的在线采样，但是在我们的设置中观察到效果很好。使用所提出的算法对基于速率的流量控制问题进行了数值实验。我们在这里考虑连续时间排队框架中单个瓶颈节点的模型。我们展示了我们的算法与Konda和Borkar（1999）以及Bhatnagar和Kumar（2004）的两个时间尺度演员评论算法的性能比较。与Konda和Borkar（1999）相比，我们的算法表现出更好的性能。

著录项

来源
《Discrete Event Dynamic Systems》 |2007年第1期|23-52|共30页
作者
Mohammed Shahid Abdulla; Shalabh Bhatnagar;
展开▼
作者单位

Department of Computer Science and Automation Indian Institute of Science Bangalore 560 012 India;

Department of Computer Science and Automation Indian Institute of Science Bangalore 560 012 India;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Actor-critic algorithms; Two timescale stochastic approximation; Markov decision processes; Policy iteration; Simultaneous perturbation stochastic approximation; Normalized Hadamard matrices; Reinforcement learning; TD-learning;

机译：Actor-critic算法;两个时间尺度随机逼近;Markov决策过程;策略迭代;同时扰动随机逼近;归一化Hadamard矩阵;强化学习;TD学习;

相似文献

外文文献
中文文献
专利

1. Reinforcement learning based algorithms for average cost Markov Decision Processes [J] . Abdulla MS, Bhatnagar S Discrete event dynamic systems: Theory and applications . 2007,第1期

机译：基于增强学习的平均成本马尔可夫决策过程算法
2. Learning algorithms or Markov decision processes with average cost [J] . Abounadi J., Borkar VS., Bertsekas D. SIAM Journal on Control and Optimization . 2001,第3期

机译：学习算法或马尔可夫决策过程的平均成本
3. Adaptive aggregation for reinforcement learning in average reward Markov decision processes [J] . Ronald Ortner Annals of Operations Research . 2013,第1期

机译：自适应聚合用于平均奖励马尔可夫决策过程中的强化学习
4. Reinforcement learning algorithms for semi-Markov decision processes with average reward [C] . Li Yanjie Networking, Sensing and Control (ICNSC), 2012 9th IEEE International Conference on . 2012

机译：具有平均奖励的半马尔可夫决策过程的强化学习算法
5. A New Reinforcement Learning Algorithm with Fixed Exploration for Semi-Markov Decision Processes [D] . Encapera, Angelo Michael. 2017

机译：半马尔可夫决策过程的固定探索新强化学习算法
6. Learning to maximize reward rate: a model based on semi-Markov decision processes [O] . Arash Khodadadi, Pegah Fakhari, Jerome R. Busemeyer 2014

机译：学习最大化奖励率：基于半马尔可夫决策过程的模型
7. Learning Algorithms for Markov Decision Processes with Average Cost [O] . J. Abounadi, D. Bertsekas, V. S. Borkar 2001

机译：具有平均成本的马尔可夫决策过程的学习算法

Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅