Policy iteration (PI) is a recursive process of policy evaluation andimprovement to solve an optimal decision-making, e.g., reinforcement learning(RL) or optimal control problem and has served as the fundamental to develop RLmethods. Motivated by integral PI (IPI) schemes in optimal control and RLmethods in continuous time and space (CTS), this paper proposes on-policy IPIto solve the general RL problem in CTS, with its environment modeled by anordinary differential equation (ODE). In such continuous domain, we alsopropose four off-policy IPI methods---two are the ideal PI forms that useadvantage and Q-functions, respectively, and the other two are naturalextensions of the existing off-policy IPI schemes to our general RL framework.Compared to the IPI methods in optimal control, the proposed IPI schemes can beapplied to more general situations and do not require an initial stabilizingpolicy to run; they are also strongly relevant to the RL algorithms in CTS suchas advantage updating, Q-learning, and value-gradient based (VGB) greedy policyimprovement. Our on-policy IPI is basically model-based but can be madepartially model-free; each off-policy method is also either partially orcompletely model-free. The mathematical properties of the IPImethods---admissibility, monotone improvement, and convergence towards theoptimal solution---are all rigorously proven, together with the equivalence ofon- and off-policy IPI. Finally, the IPI methods are simulated with aninverted-pendulum model to support the theory and verify the performance.
展开▼