首页> 中文期刊> 《通信学报》 >基于两层模糊划分的时间差分算法

基于两层模糊划分的时间差分算法

         

摘要

针对传统的基于查询表或函数逼近的Q值迭代算法在处理连续空间问题时收敛速度慢、且不易求解连续行为策略的问题,提出了一种基于两层模糊划分的在策略时间差分算法--DFP-OPTD,并从理论上分析其收敛性。算法中第一层模糊划分作用于状态空间,第二层模糊划分作用于动作空间,并结合两层模糊划分计算出Q值函数。根据所得的Q值函数,使用梯度下降方法更新模糊规则中的后件参数。将DFP-OPTD应用于经典强化学习问题中,实验结果表明,该算法有较好的收敛性能,且可以求解连续行为策略。%When dealing with the continuous space problems, the traditional Q-iteration algorithms based on lookup-table or function approximation converge slowly and are difficult to get a continuous policy. To overcome the above weak-nesses, an on-policy TD algorithm named DFP-OPTD was proposed based on double-layer fuzzy partitioning and its convergence was proved. The first layer of fuzzy partitioning was applied for state space, the second layer of fuzzy parti-tioning was applied for action space, and Q-value functions were computed by the combination of the two layer fuzzy partitioning. Based on the Q-value function, the consequent parameters of fuzzy rules were updated by gradient descent method. Applying DFP-OPTD on two classical reinforcement learning problems, experimental results show that the algo-rithm not only can be used to get a continuous action policy, but also has a better convergence performance.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号