首页> 外文期刊>ACM Transactions on Modeling and Computer Simulation >Actor-Critic Algorithms with Online Feature Adaptation
【24h】

Actor-Critic Algorithms with Online Feature Adaptation

机译:具有在线特征自适应的Actor-Critic算法

获取原文
获取原文并翻译 | 示例

摘要

We develop two new online actor-critic control algorithms with adaptive feature tuning for Markov Decision Processes (MDPs). One of our algorithms is proposed for the long-run average cost objective, while the other works for discounted cost MDPs. Our actor-critic architecture incorporates parameterization both in the policy and the value function. A gradient search in the policy parameters is performed to improve the performance of the actor. The computation of the aforementioned gradient, however, requires an estimate of the value function of the policy corresponding to the current actor parameter. The value function, on the other hand, is approximated using linear function approximation and obtained from the critic. The error in approximation of the value function, however, results in suboptimal policies. In our article, we also update the features by performing a gradient descent on the Grassmannian of features to minimize a mean square Bellman error objective in order to find the best features. The aim is to obtain a good approximation of the value function and thereby ensure convergence of the actor to locally optimal policies. In order to estimate the gradient of the objective in the case of the average cost criterion, we utilize the policy gradient theorem, while in the case of the discounted cost objective, we utilize the simultaneous perturbation stochastic approximation (SPSA) scheme. We prove that our actor-critic algorithms converge to locally optimal policies. Experiments on two different settings show performance improvements resulting from our feature adaptation scheme.
机译:我们针对Markov决策过程(MDP)开发了两种具有自适应特征调整功能的新的在线actor-critic控制算法。我们针对长期平均成本目标提出了一种算法,而另一种算法则针对折扣成本MDP。我们的行动者批评体系结构将参数化纳入了策略和价值函数中。执行策略参数中的梯度搜索以提高参与者的性能。但是,上述梯度的计算需要对与当前参与者参数相对应的策略的值函数进行估计。另一方面,使用线性函数近似来近似值函数,并从评论家那里获得。但是,近似值函数的错误会导致策略不理想。在我们的文章中,我们还通过对特征的Grassmannian进行梯度下降以最小化均方Bellman误差目标,以找到最佳特征,从而更新了特征。目的是获得价值函数的良好近似值,从而确保参与者与局部最优策略的融合。为了在平均成本标准的情况下估计目标的梯度,我们使用了策略梯度定理,而在折现成本目标的情况下,我们采用了同时扰动随机逼近(SPSA)方案。我们证明了演员批评算法收敛于局部最优策略。在两种不同设置上进行的实验表明,我们的功能自适应方案可以改善性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号