首页> 外文期刊>ACM transactions on multimedia computing communications and applications >Modality-Invariant Image-Text Embedding for Image-Sentence Matching
【24h】

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

机译:图像句子匹配的模态不变图像文本嵌入

获取原文
获取原文并翻译 | 示例

摘要

Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss-based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.
机译:在不同模式(例如图像和文本)之间执行直接匹配可以使计算机视觉,多媒体,信息检索和信息融合中的许多任务受益。现有的大多数作品都集中在类级图像-文本匹配上,称为交叉模式检索,该模型试图提出一个统一的模型来匹配图像与所有类型的文本,例如标签,句子和文章(长文本)。尽管跨模型检索减轻了视觉和文本信息之间的异类差距,但它只能提供两种模态之间的粗略对应。在本文中,我们提出了一种更精确的图像-文本嵌入方法,即图像-句子匹配,它可以在实例级别提供异构的匹配。图像文本嵌入的关键问题是如何使两种模态的分布在嵌入空间中保持一致。为了解决这个问题,有关跨模型检索任务的一些先前工作已尝试通过采用对抗学习来拉近其分布。然而,对抗学习在图像句子匹配中的有效性尚未得到证明,仍然没有一种有效的方法。受先前工作的启发,我们建议通过对抗性学习来学习用于图像句子匹配的模态不变图像文本嵌入。在基于三元组损失的基线之上,我们设计了一个具有对抗性损失的模态分类网络,该网络将嵌入分类为图像或文本模态。另外,精心设计了多阶段训练程序,以使所提出的网络不仅通过真实标签强加图像-文本相似性约束,而且通过对抗学习使图像和文本的嵌入分布变得相似。在两个公共数据集(Flickr30k和MSCOCO)上的实验表明,我们的方法相对于基线模型具有稳定的准确性提高,并且我们的结果与最新方法相比具有优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号