Various implementations of the subject matter relate to moment localization in media stream. In some implementations, a two-dimensional temporal feature map representing a plurality of moments within a media stream is extracted from the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments. A correlation between the plurality of moments and an action in the media stream is determined based on the two-dimensional temporal feature map.
展开▼