A neural network system for localising facial landmarks in an image comprises two or more convolutional neural networks in series. Each convolution neural network comprises a plurality of downsampling layers, a plurality of upsampling layers, and a plurality of lateral connections connecting layers of equal size. Each of the lateral connections comprises one or more aggregation nodes having a upsampling input; and at least one of the aggregation nodes further includes a downsampling input.
展开▼