首页> 外文期刊>Multimedia Tools and Applications >Reference-based model using multimodal gated recurrent units for image captioning
【24h】

Reference-based model using multimodal gated recurrent units for image captioning

机译:基于参考的模型,使用多模式门控复发单元进行图像标题

获取原文
获取原文并翻译 | 示例
           

摘要

Describing images through natural language is a challenging task in the field of computer vision. Image captioning consists of creating image descriptions that can be accomplished via deep learning architectures that use convolutional neural networks (CNNs) and recurrent neural networks (RNNs). However, traditional RNNs encounter problems such as exploding and vanishing gradients, and they exhibit poor performance when generating non-descriptive sentences. To solve these issues, we proposed a model based on the encoder-decoder structure using CNNs to extract the image features and multimodal gated recurrent units (GRU) for descriptions. This model implements the part-of-speech (PoS) and likelihood function for weight generation in the GRU. The method performs knowledge transfer during a validation phase that uses the k-nearest neighbors technique (ANN). Experimental results using the Flickr30k and MSCOCO datasets demonstrated that the proposed PoS-based model presents competitive scores in comparison to state-of-the-art models. The system predicts more descriptive captions and closely approximates the expected captions both in the predicted and £NN selected captions.
机译:通过自然语言描述图像是计算机视野领域的具有挑战性的任务。图像字幕包括创建可以通过使用卷积神经网络(CNNS)和经常性神经网络(RNN)的深度学习架构来完成的图像描述。然而,传统的RNN遇到爆炸和消失梯度等问题,并且在产生非描述性句子时表现出较差的性能。为了解决这些问题,我们提出了一种基于编码器 - 解码器结构的模型,该模型使用CNN来提取图像特征和多模式门控复发单元(GRU)以进行描述。该模型实现了GRU中的重量生成的语音(POS)和似然函数。该方法在使用K-Collect邻居技术(ANN)的验证阶段期间执行知识传输。使用FlickR30K和MSCOCO数据集的实验结果表明,与最先进的模型相比,所提出的基于POS的模型具有竞争性分数。该系统预测更多描述性标题并与预测和£NN所选字幕的预期标题紧密地近似。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号