Mining more rich visual features and analyzing the context information from image for decoding part has become a challenging problem in image captioning. Some recent works employ other knowledge bases to obtain the additional objects semantic relationships by constructing scene graph, which spend much time on pre-training scene graph and these artificial defined relationships may not be comprehensive. In this paper, a novel hierarchical decoding with latent context method is proposed for image captioning, which analyzes the visual context information and decodes multi-level visual features by a hierarchical decoding method to achieve more accurate caption words. In our proposed method, a novel Latent Context Generation Network (LCGN) is proposed to infer latent relationships between objects without any external knowledge, and meanwhile, a context vector which contains rich neighbor information for each object is constructed. Then a graph convolutional network with attention is used to further aggregate latent context information for achieving high-level context features by combining objects features and their context vectors. Finally, hierarchical decoding based on Triple Long Short-Term Memory (Tri-LSTM) is proposed to decode global features, local features and object features hierarchically, which gradually analyzes the content of the image from the whole to the local to the object. Experiments on MSCOCO dataset prove that our proposed method can achieve extremely competitive results in image captioning and outperform most CNN-RNN architecture methods.
展开▼