...
首页> 外文期刊>Mathematical Problems in Engineering: Theory, Methods and Applications >Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization
【24h】

Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization

机译:深视觉语义嵌入文本数据增强和单词嵌入初始化

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Language and vision are the two most essential parts of human intelligence for interpreting the real world around us. How to make connections between language and vision is the key point in current research. Multimodality methods like visual semantic embedding have been widely studied recently, which unify images and corresponding texts into the same feature space. Inspired by the recent development of text data augmentation and a simple but powerful technique proposed called EDA (easy data augmentation), we can expand the information with given data using EDA to improve the performance of models. In this paper, we take advantage of the text data augmentation technique and word embedding initialization for multimodality retrieval. We utilize EDA for text data augmentation, word embedding initialization for text encoder based on recurrent neural networks, and minimizing the gap between the two spaces by triplet ranking loss with hard negative mining. On two Flickr-based datasets, we achieve the same recall with only 60% of the training dataset as the normal training with full available data. Experiment results show the improvement of our proposed model; and, on all datasets in this paper (Flickr8k, Flickr30k, and MS-COCO), our model performs better on image annotation and image retrieval tasks; the experiments also demonstrate that text data augmentation is more suitable for smaller datasets, while word embedding initialization is suitable for larger ones.
机译:语言和愿景是人类智慧的两个最重要的部分,用于解释我们周围的现实世界。如何在语言和愿景之间进行连接是当前研究的关键点。最近已经广泛研究了视觉语义嵌入等多模态方法,将图像和对应文本统一到相同的特征空间中。灵感来自最近的文本数据增强和一个简单但强大的技术,称为EDA(简单的数据增强),我们可以使用EDA扩展信息,以提高模型的性能。在本文中,我们利用了文本数据增强技术和嵌入初始化的文本数据增强技术,以进行多模检索。我们利用EDA进行文本数据增强,基于经常性神经网络的文本编码器的单词嵌入初始化,并通过三联排名损耗来最小化两个空间之间的间隙。在两个基于Flickr的数据集中,我们可以获得与具有完整可用数据的正常培训的培训数据集相同的召回。实验结果表明我们所提出的模型的改进;并且,在本文的所有数据集(FlickR8K,FlickR30K和MS-Coco)上,我们的模型在图像注释和图像检索任务上执行更好;实验还证明了文本数据增强更适合于较小的数据集,而嵌入初始化的单词适用于较大的数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号