In the framework of Markov Decision Processes, we considerlinear off-policy learning, that is the problem oflearning a linear approximation of the value function of somefixed policy from one trajectory possibly generated by someother policy. We briefly review on-policy learningalgorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. Then, we highlighta systematic approach for adapting them to off-policylearning with eligibility traces. This leads to someknown algorithms---off-policy LSTD($lambda$), LSPE($lambda$),TD($lambda$), TDC/GQ($lambda$)---and suggests new extensions---off-policy FPKF($lambda$), BRM($lambda$), gBRM($lambda$),GTD2($lambda$). We describe a comprehensive algorithmicderivation of all algorithms in a recursive and memory-efficentform, discuss their known convergence properties and illustratetheir relative empirical behavior on Garnet problems. Ourexperiments suggest that the most standard algorithms on andoff-policy LSTD($lambda$)/LSPE($lambda$)---and TD($lambda$)if the feature space dimension is too large for a least-squaresapproach---perform the best. color="gray">
展开▼