Methods, systems, and apparatus, including computer programs encoded on computer storage media, of selecting actions from a set of actions to be performed in an environment. One of the methods includes, at each time step: maintaining count data; determining, for each action, a respective current transition probability distribution that includes a respective current transition probability for each of the intermediate signals that represents an estimate of a current likelihood that the intermediate signal will be observed if the action is performed; determining, for each intermediate signal, a respective reward estimate that is an estimate of a reward that will be received as a result of the intermediate signal being observed; determining, from the respective current transition probability distributions and the respective reward estimates, a respective action score for each action; and selecting an action to be performed based on the respective action scores.
展开▼