Policy Learning for Time-Bounded Reachability in Continuous-Time Markov Decision Processes via Doubly-Stochastic Gradient Ascent

机译：基于双随机梯度上升的连续时间马尔可夫决策过程中时间可及性的策略学习

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Continuous-time Markov decision processes are an important class of models in a wide range of applications, ranging from cyber-physical systems to synthetic biology. A central problem is how to devise a policy to control the system in order to maximise the probability of satisfying a set of temporal logic specifications. Here we present a novel approach based on statistical model checking and an unbiased estimation of a functional gradient in the space of possible policies. The statistical approach has several advantages over conventional approaches based on uniformisation, as it can also be applied when the model is replaced by a black box, and does not suffer from state-space explosion. The use of a stochastic gradient to guide our search considerably improves the efficiency of learning policies. We demonstrate the method on a proof-of-principle non-linear population model, showing strong performance in a non-trivial task.

机译：连续时间的马尔可夫决策过程是从网络物理系统到合成生物学的广泛应用中的一类重要模型。中心问题是如何设计一种策略来控制系统，以使满足一组时间逻辑规范的可能性最大化。在这里，我们提出了一种基于统计模型检查和在可能的策略空间中功能梯度的无偏估计的新颖方法。统计方法与基于均匀化的常规方法相比具有多个优点，因为当模型被黑匣子替换时，统计方法也可以应用，并且不会遭受状态空间爆炸的影响。使用随机梯度来指导我们的搜索可以大大提高学习策略的效率。我们在原则证明的非线性总体模型上演示了该方法，在非平凡的任务中显示出强大的性能。

著录项

来源
《International conference on quantitative evaluation of systems》|2016年|244-259|共16页
会议地点
作者
Ezio Bartocci; Luca Bortolussi; Tomas Brazdil; Dimitrios Milios; Guido Sanguinetti;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Efficient computation of time-bounded reachability probabilities in uniform continuous-time Markov decision processes [J] . Christel Baier, Holger Hermanns, Joost-Pieter Katoen, Theoretical computer science . 2005,第1期

机译：统一连续时间马尔可夫决策过程中有界可及性概率的高效计算
2. Policy learning in continuous-time Markov decision processes using Gaussian Processes [J] . Bartocci Ezio, Bortolussi Luca, Brazdil Tomas, Performance Evaluation . 2017,第nova期

机译：使用高斯过程的连续时间马尔可夫决策过程中的策略学习
3. Optimality of Mixed Policies for Average Continuous-Time Markov Decision Processes with Constraints [J] . Guo Xianping, Zhang Yi Mathematics of operations research . 2016,第4期

机译：约束条件下平均连续时间马尔可夫决策过程混合策略的最优性
4. Policy Learning for Time-Bounded Reachability in Continuous-Time Markov Decision Processes via Doubly-Stochastic Gradient Ascent [C] . Ezio Bartocci, Luca Bortolussi, Tomas Brazdil, International Conference on Quantitative Evaluation of Systems . 2016

机译：通过双随机梯度上升策略学习连续时间马尔可夫决策过程中的时间限定可达性
5. Finite memory policies for partially observable Markov decision processes. [D] . Lusena, Christopher David. 2001

机译：用于部分可观察的马尔可夫决策过程的有限内存策略。
6. Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play [O] . Sherif Abdelfattah, Kathryn Kasmarik, Jiankun Hu 2018

机译：通过内在动机的自我博弈在多目标马尔可夫决策过程中发展稳健的政策覆盖范围
7. Policy learning for time-bounded reachability in continuous-time Markov decision processes via doubly-stochastic gradient ascent [O] . Bartocci Ezio, Bortolussi Luca, Brázdil Tomǎš, 2016

机译：通过双随机梯度上升进行连续时间马尔可夫决策过程中时间可及性的策略学习

Policy Learning for Time-Bounded Reachability in Continuous-Time Markov Decision Processes via Doubly-Stochastic Gradient Ascent

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅