首页> 外文期刊>IEEE Transactions on Image Processing >Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation
【24h】

Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation

机译:通过跨模型检索和模型自适应跨域图像标题

获取原文
获取原文并翻译 | 示例

摘要

In recent years, large scale datasets of paired images and sentences have enabled the remarkable success in automatically generating descriptions for images, namely image captioning. However, it is labour-intensive and time-consuming to collect a sufficient number of paired images and sentences in each domain. It may be beneficial to transfer the image captioning model trained in an existing domain with pairs of images and sentences (i.e., source domain) to a new domain with only unpaired data (i.e., target domain). In this paper, we propose a cross-modal retrieval aided approach to cross-domain image captioning that leverages a cross-modal retrieval model to generate pseudo pairs of images and sentences in the target domain to facilitate the adaptation of the captioning model. To learn the correlation between images and sentences in the target domain, we propose an iterative cross-modal retrieval process where a cross-modal retrieval model is first pre-trained using the source domain data and then applied to the target domain data to acquire an initial set of pseudo image-sentence pairs. The pseudo image-sentence pairs are further refined by iteratively fine-tuning the retrieval model with the pseudo image-sentence pairs and updating the pseudo image-sentence pairs using the retrieval model. To make the linguistic patterns of the sentences learned in the source domain adapt well to the target domain, we propose an adaptive image captioning model with a self-attention mechanism fine-tuned using the refined pseudo image-sentence pairs. Experimental results on several settings where MSCOCO is used as the source domain and five different datasets (Flickr30k, TGIF, CUB-200, Oxford-102 and Conceptual) are used as the target domains demonstrate that our method achieves mostly better or comparable performance against the state-of-the-art methods. We also extend our method to cross-domain video captioning where MSR-VTT is used as the source domain and two other datasets (MSVD and Charades Captions) are used as the target domains to further demonstrate the effectiveness of our method.
机译:近年来,配对图像和句子的大规模数据集在自动生成图像的描述中,使得可以显着成功,即图像标题。然而,它是劳动密集型和耗时的耗时,以在每个域中收集足够数量的配对图像和句子。将在现有域中的图像标题(即,源域)对传送在现有域中的图像标题模型(即,源域)对仅具有未配对数据(即目标域)的新域来传输培训的图像标题模型可能是有益的。在本文中,我们提出了一种跨域图像标题的跨模型检索方法,其利用跨模型检索模型来生成目标域中的伪对图像和句子,以便于改编标题模型。为了学习目标域中的图像和句子之间的相关性,我们提出了一种迭代跨模型检索过程,其中首先使用源域数据预先训练跨模型检索模型,然后将其应用于目标域数据以获取初始集伪图像句子对。通过使用伪图像句子对迭代地微调检索模型并使用检索模型更新伪图像句子对来进一步改进伪图像句子对。为了使源域中学习的句子的语言模式适应于目标域,我们提出了一种自适应图像标题模型,其具有使用精细伪图像句对进行微观调谐的自我注意机制。在使用MSCOCO作为源域的几种设置和五个不同的数据集(Flickr30K,TGIF,Cub-200,牛津-102和概念)的实验结果用作目标域表明我们的方法大多达到了更好或相当的性能最先进的方法。我们还将我们的方法扩展到跨域视频字幕,其中MSR-VTT用作源域,另外两个数据集(MSVD和Chambes字幕)用作目标域以进一步展示我们方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号