Deep reinforcement learning of cooperative neural networks can be performed by obtaining an action and observation sequence including a plurality of time frames, each time frame including action values and observation values. At least some of the observation values of each time frame of the action and observation sequence can be input sequentially into a first neural network including a plurality of first parameters. The action values of each time frame of the action and observation sequence and output values from the first neural network corresponding to the at least some of the observation values of each time frame of the action and observation sequence can be input sequentially into a second neural network including a plurality of second parameters. An action-value function can be approximated using the second neural network, and the plurality of first parameters of the first neural network can be updated using backpropagation.
展开▼