Described is a computer-implemented device (1200) and method (1000) for processing a multi-agent system input to form an at least partially optimised output indicative of an action policy. The method (1000) comprises receiving (1001) the multi-agent system input, the multi-agent system input comprising a definition of a multi-agent system and defining behaviour patterns of a plurality of agents based on system states; receiving (1002) an indication of an input system state; performing (1003) an iterative machine learning process to estimate a single aggregate function representing the behaviour patterns of the plurality of agents over a set of system states; and iteratively processing (1004) the single aggregate function for the input system state to estimate an at least partially optimised set of actions for each of the plurality of agents in the input system state. This may allow policies corresponding to the Nash equilibrium to be learned.
展开▼