...
首页> 外文期刊>Multimedia Tools and Applications >Image captions: global-local and joint signals attention model (GL-JSAM)
【24h】

Image captions: global-local and joint signals attention model (GL-JSAM)

机译:图像字幕:全局 - 本地和联合信号注意模型(GL-JSAM)

获取原文
获取原文并翻译 | 示例

摘要

For automated visual captioning, existing neural encoder-decoder methods commonly use a simple sequence-to-sequence or an attention-based mechanism. The attention-based models pay attention to specific visual areas or objects; using a single heat map that indicates which portion of the image is most important rather than treating the objects (within the image) equally. These models are usually a mixture of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) architectures. CNN's generally extract global visual signals that only provide global information of main objects, attributes, and their relationship, but fail to provide local (regional) information within objects, such as lines, corners, curve and edges. On one hand, missing some of the information and details of local visual signals may lead to misprediction, misidentification of objects or completely missing the main object(s). On the other hand, additional visual signals information produces meaningless and irrelevant description, which may be coming from objects in foreground or background. To address these concerns, we created a new joint signals attention image captioning model for global and local signals that is adaptive by nature. Primarily, proposed model extracts global visual signals at image-level and local visual signals at object-level. The joint signal attention model (JSAM) plays a dual role in visual signal extraction and non-visual signal prediction. Initially, JSAM selects meaningful global and regional visual signals to discard irrelevant visual signals and integrates selected visual signals smartly. Subsequently, in a language model, smart JSAM decides at each time-step (level) on how to attend visual or non-visual signals to generate accurate, descriptive, and elegant sentences. Lastly, we examine the efficiency and superiority of the projected model over recent similar image captioning models by conducting essential experimentations on the MS-COCO dataset.
机译:对于自动视觉标题,现有的神经编码器 - 解码器方法通常使用简单的序列到序列或基于关注的机制。基于注意的模型注意特定的视觉区域或物体;使用单个热图表示图像的哪个部分是最重要的,而不是同等地处理物体(在图像内)。这些模型通常是卷积神经网络(CNN)和经常性神经网络(RNN)架构的混合。 CNN通常提取仅提供主要对象,属性及其关系的全局视觉信号,而是未能提供对象中的本地(区域)信息,例如线路,角落,曲线和边缘。一方面,缺少一些信息和本地视觉信号的细节可能导致错误的错误,对象的错误识别或完全缺少主要对象。另一方面,额外的视觉信号信息产生毫无意义和无关的描述,这可能来自前景或背景中的物体。为了解决这些问题,我们创建了一种新的联合信号注意图像标题模型,用于自然自适应的全局和本地信号。主要,所提出的模型在对象级别提取图像级和局部视觉信号的全局视觉信号。关节信号注意模型(JSAM)在视觉信号提取和非视信号预测中起作用的双重作用。最初,JSAM选择有意义的全局和区域视觉信号来丢弃无关的视觉信号,并巧妙地集成所选择的视觉信号。随后,在语言模型中,Smart JSAM在每个时间步骤(级别)上,如何参加视觉或非视觉信号以生成准确,描述性和优雅的句子。最后,我们通过在MS-Coco DataSet上进行基本实验来研究最近类似的图像标题模型的预计模型的效率和优越性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号