首页> 外文期刊>Discrete Event Dynamic Systems >Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes
【24h】

Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

机译:基于增强学习的平均成本马尔可夫决策过程算法

获取原文
获取原文并翻译 | 示例

摘要

This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).
机译:本文提出了几种基于两时间尺度仿真的actor-critic算法,用于在平均成本准则下求解具有有限状态空间的无限地平线马尔可夫决策过程。其中两种算法用于紧凑(非离散)动作设置,而其余算法则用于有限动作空间。在较慢的时间尺度上,所有算法都使用两个不同的同时扰动随机近似(SPSA)梯度估计值在相应的策略空间上执行梯度搜索。在更快的时间尺度上,将更新与给定固定策略相对应的差异成本函数,并执行附加平均以增强性能。提出了收敛到局部最优策略的证明。接下来,我们讨论一种内存有效的实现,该实现使用状态空间的基于特征的表示并沿着更快的时标执行TD(0)学习。 TD(0)算法不遵循状态的在线采样,但是在我们的设置中观察到效果很好。使用所提出的算法对基于速率的流量控制问题进行了数值实验。我们在这里考虑连续时间排队框架中单个瓶颈节点的模型。我们展示了我们的算法与Konda和Borkar(1999)以及Bhatnagar和Kumar(2004)的两个时间尺度演员评论算法的性能比较。与Konda和Borkar(1999)相比,我们的算法表现出更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号