A two dimensional image can be received. A depth map can be produced, via a first neural network, from the two dimensional image. A bird's eye view image can be produced, via a second neural network, from the depth map. The second neural network can implement a machine learning algorithm that preserves spatial gradient information associated with one or more objects included in the depth map and causes a position of a pixel in an object, included in the bird's eye view image, to be represented by a differentiable function. Three dimensional objects can be detected, via a third neural network, from the two dimensional image, the bird's eye view image, and the spatial gradient information. A combination of the first neural network, the second neural network, and the third neural network can be end-to-end trainable and can be included in a perception system.
展开▼