【24h】

Model-building semi-Markov adaptive critics

机译:模型建立的半马尔可夫自适应批评家

获取原文

摘要

Adaptive or actor critics are a class of reinforcement learning (RL) or approximate dynamic programming (ADP) algorithms in which one searches over stochastic policies in order to determine the optimal deterministic policy. Classically, these algorithms have been studied for Markov decision processes (MDPs) in the context of model-free updates in which transition probabilities are avoided altogether. A model-free version for the semi-MDP (SMDP) for discounted reward in which the transition time of each transition can be a random variable was proposed in Gosavi [1]. In this paper, we propose a variant in which the transition probability model is built simultaneously with the value function and action-probability functions. While our new algorithm does not require the transition probabilities apriori, it generates them along with the estimation of the value function and the action-probability functions required in adaptive critics. Model-building and model-based versions of algorithms have numerous advantages in contrast to their model-free counterparts. In particular, they are more stable and may require less training. However the additional steps of building the model may require increased storage in the computer's memory. In addition to enumerating potential application areas for our algorithm, we will analyze the advantages and disadvantages of model building.
机译:自适应或演员评论家是一类强化学习(RL)或近似动态规划(ADP)算法,在该算法中,人们搜索随机策略以确定最佳确定性策略。传统上,已经在无模型更新的情况下针对Markov决策过程(MDP)研究了这些算法,在这些模型中,完全避免了转移概率。在Gosavi [1]中提出了一种半MDP(SMDP)折扣打折的无模型版本,其中每个转换的转换时间可以是随机变量。在本文中,我们提出了一种变体,其中同时建立了转移概率模型以及值函数和动作概率函数。虽然我们的新算法不需要先验的转移概率,但它会与自适应评论家中所需的价值函数和动作概率函数的估计一起生成它们。与无模型的模型相比,算法的模型构建和基于模型的版本具有众多优势。特别是,它们更稳定,可能需要较少的培训。但是,构建模型的其他步骤可能需要增加计算机内存中的存储量。除了列举该算法的潜在应用领域之外,我们还将分析模型构建的优缺点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号