首页> 外文会议>IEEE Annual Conference on Decision and Control >Approximate Dynamic Programming with (min; +) linear function approximation for Markov decision processes
【24h】

Approximate Dynamic Programming with (min; +) linear function approximation for Markov decision processes

机译:马尔可夫决策过程的(最小; +)线性函数近似的近似动态规划

获取原文

摘要

Markov Decision Process (MDP) is a useful framework to study problems of optimal sequential decision making under uncertainty. Given any MDP the aim here is to find the optimal action selection mechanism i.e., the optimal policy. Typically, the optimal policy (u*) is obtained by substituting the optimal value-function (J*) in the Bellman equation. Alternatively, u* is also obtained by learning the optimal state-action value function Q* known as the Q value-function. However, it is difficult to compute the exact values of J* or Q* for MDPs with large number of states. Approximate Dynamic Programming (ADP) methods address this difficulty by computing lower dimensional approximations of J*/Q*. Most ADP methods employ linear function approximation (LFA), i.e., the approximate solution lies in a subspace spanned by a family of pre-selected basis functions. The approximation is obtained via a linear least squares projection of higher dimensional quantities and the L norm plays an important role in convergence and error analysis. In this paper, we discuss ADP methods for MDPs based on LFAs in the (min, +) algebra. Here the approximate solution is a (min, +) linear combination of a set of basis functions whose span constitutes a subsemimodule. Approximation is obtained via a projection operator onto the subsemimodule which is different from linear least squares projection used in ADP methods based on conventional LFAs. MDPs are not (min, +) linear systems, nevertheless, we show that the monotonicity property of the projection operator helps us establish the convergence of our ADP schemes. We also discuss future directions in ADP methods for MDPs based on the (min, +) LFAs.
机译:马尔可夫决策过程(MDP)是研究不确定条件下最优顺序决策问题的有用框架。给定任何MDP,此处的目的是找到最佳动作选择机制,即最佳策略。通常,通过将最佳值函数(J *)替换为Bellman方程来获得最佳策略(u *)。可替代地,还通过学习被称为Q值函数的最佳状态作用值函数Q *来获得u *。但是,很难为状态数量众多的MDP计算J *或Q *的确切值。近似动态编程(ADP)方法通过计算J * / Q *的低维近似值解决了这一难题。大多数ADP方法采用线性函数逼近(LFA),即,近似解位于由一系列预选基函数所覆盖的子空间中。通过较高维数量的线性最小二乘投影获得近似值,L范数在收敛和误差分析中起重要作用。在本文中,我们讨论了在(min,+)代数中基于LFA的MDP的ADP方法。在这里,近似解是一组基础函数的(最小,+)线性组合,这些基础函数的跨度构成一个半模块。通过投影算子可以在子半模块上获得近似值,该近似值不同于基于传统LFA的ADP方法中使用的线性最小二乘投影。 MDP不是(min,+)线性系统,但是,我们证明了投影算子的单调性有助于我们建立ADP方案的收敛性。我们还将讨论基于(min,+)LFA的MDP的ADP方法的未来发展方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号