An image caption generation model with adaptive attention mechanism is proposed for dealing with the weakness of theimage description model by the local image features. Under the framework of encoder and decoder architecture, the localand global features of images are extracted by using inception V3 and VGG19 network models at the encoder. Since theadaptive attention mechanism proposed in this paper can automatically identify and acquire the importance of local andglobal image information, the decoder can generate sentences describing the image more intuitively and accurately. Theproposed model is trained and tested on Microsoft COCO dataset. The experimental results show that the proposedmethod can extract more abundant and complete information from the image and generate more accurate sentences,compared with the image caption model based on local features.
展开▼