A neural network training system includes one or more computers and one or more storage devices for storing instructions that, when executed by one or more computers, cause said computer (s) to perform operations to train a neural value network that serves to facilitate a network Receiving observation that characterizes the state of an environment interacting with an agent system and serving to process that observation according to the parameters of the neural value network to generate a score, the operations comprising: training a neural A supervised learning network network wherein the supervised learning neural network is used to receive the observation and to process that observation according to the neural network parameters with the supervised learning policy, for each action in a series of possible surveys Actions to generate a respective action probability that can be performed by the agent system to interact with the environment, and wherein training the neural network with policy for supervised learning, training the neural network with supervised learning policy with respect to labeled training data using includes the supervised learning policy to determine the trained parameter values of the neural network using the supervised learning policy; Initializing parameter initial values of a neural network with learning support policy having the same architecture as the neural network with supervised learning policy versus the trained parameter values of the neural network with the supervised learning policy; Training the neural network with learning support policy relating to the second training data generated by interactions of the agent system with a simulated version of the environment using the learning support to determine from the initial values the trained parameter values of the neural network with learning support policy; and training the neural value network to generate a value score for the state of the environment that represents a predicted long term reward that results from the state in the state by training the neural value network with respect to the third training data resulting from the interactions of the agent system with the simulated version of the environment were generated using the supervised learning policy to determine from the parameter initial values of the neural value network the trained parameter values of the neural value network.
展开▼