首页> 外文期刊>IEEE Transactions on Image Processing >MAVA: Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism
【24h】

MAVA: Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism

机译:Mava:跨媒体双关注机制的多级自适应视觉校准

获取原文
获取原文并翻译 | 示例

摘要

The rapidly developing information technology leads to a fast growth of visual and textual contents, and it comes with huge challenges to make correlation and perform cross-media retrieval between images and sentences. Existing methods mainly explore cross-media correlation from either global-level instances as the whole images and sentences, or local-level fine-grained patches as the discriminative image regions and key words, which ignore the complementary information from the relation between local-level fine-grained patches. Naturally, relation understanding is highly important for learning cross-media correlation. People focus on not only the alignment between discriminative image regions and key words, but also their relations lying in the visual and textual context. Therefore, in this paper, we propose Multi-level Adaptive Visual-textual Alignment (MAVA) approach with the following contributions. First, we propose cross-media multi-pathway fine-grained network to extract not only the local fine-grained patches as discriminative image regions and key words, but also visual relations between image regions as well as textual relations from the context of sentences, which contain complementary information to exploit fine-grained characteristics within different media types. Second, we propose visual-textual bi-attention mechanism to distinguish the fine-grained information with different saliency from both local and relation levels, which can provide more discriminative hints for correlation learning. Third, we propose cross-media multi-level adaptive alignment to explore global, local and relation alignments. An adaptive alignment strategy is further proposed to enhance the matched pairs of different media types, and discard those misalignments adaptively to learn more precise cross-media correlation. Extensive experiments are conducted to perform image-sentence matching on 2 widely-used cross-media datasets, namely Flickr-30K and MS-COCO, comparing with 10 state-of-the-art methods, which can fully verify the effectiveness of our proposed MAVA approach.
机译:快速发展的信息技术导致视觉和文本内容的快速增长,并具有巨大的挑战来进行相关性并在图像和句子之间执行跨媒检索。现有方法主要探讨从全局级别实例作为整个图像和句子的交叉媒体关联,或作为判别图像区域和关键词的本地级别细粒度贴片,忽略本地级之间的关系的互补信息细粒度斑块。当然,关系理解对于学习跨媒相关性非常重要。人们不仅专注于鉴别性图像区域与关键词之间的对齐,而且还专注于他们的关系伴随着视觉和文本背景。因此,在本文中,我们提出了具有以下贡献的多级自适应视觉校准(Mava)方法。首先,我们提出跨媒体多通路细粒网络,不仅提取局部细粒度斑块作为鉴别的图像区域和关键词,而且从句子的背景下的图像区域和文本关系之间的视觉关系,其中包含互补信息,以利用不同媒体类型的细粒度特征。其次,我们提出了视觉文本的双关注机制,以将微粒信息与局部和关系水平不同,这可以为相关学习提供更多的辨别暗示。第三,我们提出跨媒体多级自适应对齐来探索全局,本地和关系对齐。进一步提出了一种自适应对准策略来增强匹配的不同媒体类型对,并自适应地丢弃这些未对准以学习更精确的跨媒体相关性。进行广泛的实验,以在2个广泛使用的跨媒体数据集上进行图像句,即FlickR-30K和MS-Coco,与10个最先进的方法相比,这可以完全验证我们提出的效果Mava方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号