Policy Iteration for Factored MDPs

机译：分解MDP的策略迭代

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many large MDPs can be represented compactly using a dynamic Bayesian network. Although the structure of the value function does not retain the structure of the process, recent work has suggested that value functions in factored MDPs can often be approximated well using a factored value function: a linear combination of restricted basis functions, each of which refers only to a small subset of variables. An approximate fac-tored value function for a particular policy can be computed using approximate dynamic pro-gramming, but this approach (and others) can only produce an approximation relative to a dis-tance metric which is weighted by the station-ary distribution of the current policy. This type of weighted projection is ill-suited to policy im-provement. We present a new approach to value determination, that uses a simple closed-form computation to compute a least-squares decom-posed approximation to the value function for any weights directly. We then use this value de-termination algorithm as a subroutine in a pol-icy iteration process. We show that, under rea-sonable restrictions, the policies induced by a factored value function can be compactly repre-sented as a decision list, and can be manipulated efficiently in a policy iteration process. We also present a method for computing error bounds for decomposed value functions using a variable-elimination algorithm for function optimization. The complexity of all of our algorithms depends on the factorization of the system dynamics and of the approximate value function.

机译：使用动态贝叶斯网络可以紧凑地表示许多大型MDP。尽管值函数的结构不能保留过程的结构，但是最近的工作表明，可以使用因数值函数来很好地近似因式分解后的MDP中的值函数：受限基函数的线性组合，每个基础函数仅引用一小部分变量。可以使用近似动态编程来计算特定策略的近似因子值函数，但是这种方法（和其他方法）只能产生相对于距离度量的近似值，该距离度量值由的固定分布来加权。当前的政策。这种加权预测不适用于政策改进。我们提出了一种新的价值确定方法，该方法使用简单的封闭形式计算来直接为任何权重计算对值函数的最小二乘分解近似值。然后，我们将此值确定算法用作策略迭代过程中的子例程。我们表明，在合理的限制下，因式分解函数所诱导的策略可以紧凑地表示为决策列表，并且可以在策略迭代过程中有效地进行操作。我们还提出了一种使用变量消除算法进行分解的函数功能，为函数优化计算误差范围的方法。我们所有算法的复杂性取决于系统动力学和近似值函数的分解。

著录项

来源
《Sixteenth Conference (2000) on Uncertainty in Artificial Intelligence June 30-July 3, 2000 Stanford University, Stanford, California》|2000年|p.326-334|共9页
会议地点 Stanford CA(US);Stanford CA(US)
作者
Daphne Koller; Ronald Parr;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Strong polynomiality of policy iterations for average-cost MDPs modeling replacement and maintenance problems [J] . Feinberg E.A., Huang J. Operations Research Letters: A Journal of the Operations Research Society of America . 2013,第3期

机译：用于平均成本MDP建模替换和维护问题的策略迭代的强多项式
2. Reinforcement Learning for POMDP: Partitioned Rollout and Policy Iteration With Application to Autonomous Sequential Repair Problems [J] . Bhattacharya Sushmita, Badyal Sahil, Wheeler Thomas, IEEE Robotics and Automation Letters . 2020,第3期

机译：POMDP的加固学习：分区推出和策略迭代，应用于自主顺序修复问题
3. Scalable solutions of interactive POMDPs using generalized and bounded policy iteration [J] . Ekhlas Sonu, Prashant Doshi Autonomous Agents and Multi-Agent Systems . 2015,第3期

机译：使用广义和有界策略迭代的交互式POMDP的可扩展解决方案
4. Symbolic Opportunistic Policy Iteration for Factored-Action MDPs [C] . Aswin Raghavan, Roni Khardon, Alan Fern, Annual conference on Neural Information Processing Systems . 2013

机译：因子行动MDP的符号机会策略迭代
5. Active Cyber Deception and Attacker Intent Recognition Using Factored Interactive POMDPs [D] . Shinde, Aditya P. 2020

机译：有源网络欺骗和攻击者使用因子互动POMDPS的意图识别
6. MDPs with Non-Deterministic Policies [O] . Mahdi Milani Fard, Joelle Pineau -1

机译：具有不确定性策略的MDP
7. Strong polynomiality of policy iterations for average-cost MDPs modeling replacement and maintenance problems [O] . Eugene A. Feinberg 2013

机译：用于平均成本MDP建模替换和维护问题的策略迭代的强多项式

Policy Iteration for Factored MDPs

摘要

著录项

相似文献

相关主题

期刊订阅