【24h】

Model-building semi-Markov adaptive critics

机译:模型建设半马尔可夫自适应批评者

获取原文

摘要

Adaptive or actor critics are a class of reinforcement learning (RL) or approximate dynamic programming (ADP) algorithms in which one searches over stochastic policies in order to determine the optimal deterministic policy. Classically, these algorithms have been studied for Markov decision processes (MDPs) in the context of model-free updates in which transition probabilities are avoided altogether. A model-free version for the semi-MDP (SMDP) for discounted reward in which the transition time of each transition can be a random variable was proposed in Gosavi [1]. In this paper, we propose a variant in which the transition probability model is built simultaneously with the value function and action-probability functions. While our new algorithm does not require the transition probabilities apriori, it generates them along with the estimation of the value function and the action-probability functions required in adaptive critics. Model-building and model-based versions of algorithms have numerous advantages in contrast to their model-free counterparts. In particular, they are more stable and may require less training. However the additional steps of building the model may require increased storage in the computer's memory. In addition to enumerating potential application areas for our algorithm, we will analyze the advantages and disadvantages of model building.
机译:自适应或演员批评者是一类加强学习(RL)或近似动态编程(ADP)算法,其中一个人搜索随机策略,以便确定最佳的确定性政策。经典上,已经在无模型更新的上下文中研究了这些算法,用于在无模型更新的上下文中,其中避免了转换概率。用于SEMI-MDP(SMDP)的无模型版本,用于折扣奖励,其中在GOSAVI [1]中提出了每个转换的过渡时间可以是随机变量的折扣奖励。在本文中,我们提出了一种变体,其中转换概率模型与价值函数和动作概率函数同时构建。虽然我们的新算法不需要过渡概率APRiori,但它会与估计值函数的估计和自适应批评者所需的动作概率功能一起生成它们。模型建设和基于模型的算法版本与无模型对应物相比具有许多优点。特别是,它们更稳定,可能需要较少的培训。然而,建立模型的额外步骤可能需要在计算机的内存中增加存储。除了枚举算法的潜在应用领域之外,我们将分析模型建筑的优缺点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号