首页> 外文期刊>IEEE Transactions on Image Processing >Image Captioning With End-to-End Attribute Detection and Subsequent Attributes Prediction
【24h】

Image Captioning With End-to-End Attribute Detection and Subsequent Attributes Prediction

机译:具有端到端属性检测和后续属性预测的图像标题

获取原文
获取原文并翻译 | 示例

摘要

Semantic attention has been shown to be effective in improving the performance of image captioning. The core of semantic attention based methods is to drive the model to attend to semantically important words, or attributes. In previous works, the attribute detector and the captioning network are usually independent, leading to the insufficient usage of the semantic information. Also, all the detected attributes, no matter whether they are appropriate for the linguistic context at the current step, are attended to through the whole caption generation process. This may sometimes disrupt the captioning model to attend to incorrect visual concepts. To solve these problems, we introduce two end-to- end trainable modules to closely couple attribute detection with image captioning as well as prompt the effective uses of attributes by predicting appropriate attributes at each time step. The multimodal attribute detector (MAD) module improves the attribute detection accuracy by using not only the image features but also the word embedding of attributes already existing in most captioning models. MAD models the similarity between the semantics of attributes and the image object features to facilitate accurate detection. The subsequent attribute predictor (SAP) module dynamically predicts a concise attribute subset at each time step to mitigate the diversity of image attributes. Compared to previous attribute based methods, our approach enhances the explainability in how the attributes affect the generated words and achieves a state-of-the-art single model performance of 128.8 CIDEr-D on the MSCOCO dataset. Extensive experiments on the MSCOCO dataset show that our proposal actually improves the performances in both image captioning and attribute detection simultaneously. The codes are available at: https://github.com/RubickH/Image-Captioning-with-MAD-and-SAP.
机译:语义关注已被证明是有效地改善图像标题的性能。基于语义关注的方法的核心是推动模型参加语义上重要的单词或属性。在以前的作品中,属性检测器和标题网络通常是独立的,导致语义信息的使用不足。此外,所有检测到的属性,无论它们是否适用于当前步骤的语言上下文,都会通过整个标题生成过程。这有时会扰乱标题模型以参加不正确的视觉概念。为了解决这些问题,我们介绍了两个端到端的训练模块,以便通过图像标题紧密耦合的属性检测,以及通过在每次步骤中预测适当的属性来提示属性的有效使用。多模式属性检测器(MAD)模块通过不仅使用图像特征来提高属性检测准确性,还可以提高属性检测精度,而且还提高了大多数标题模型中已经存在的属性的单词。 MAD模型属性的语义与图像对象功能之间的相似性,以便于准确检测。后续属性预测器(SAP)模块在每次步骤中动态地预测一个简洁的属性子集,以减轻图像属性的分集。与以前的基于属性的方法相比,我们的方法提高了属性如何影响生成的单词的解释性,并在MSCoCO数据集上实现了128.8个Cide-D的最先进的单一模型性能。在MSCOCO数据集上的广泛实验表明我们的提案实际上可以同时提高图像标题和属性检测的性能。该代码可用于:https://github.com/rubickh/image-caption-with-mad-and-sap。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号