首页> 外文期刊>ACM transactions on multimedia computing communications and applications >Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval
【24h】

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval

机译:具有Cluster-CCA的深度三态神经网络,用于视听跨模型检索

获取原文
获取原文并翻译 | 示例
           

摘要

Cross-modal retrieval aims to retrieve data in one modality by a query in another modality, which has been a very interesting research issue in the field of multimedia, information retrieval, and computer vision, and database. Most existing works focus on cross-modal retrieval between text-image, text-video, and lyricsaudio. Little research addresses cross-modal retrieval between audio and video due to limited audio-video paired datasets and semantic information. The main challenge of the audio-visual cross-modal retrieval task focuses on learning joint embeddings from a shared subspace for computing the similarity across different modalities, where generating new representations is to maximize the correlation between audio and visual modalities space. In this work, we propose TNN-C-CCA, a novel deep triplet neural network with cluster canonical correlation analysis, which is an end-to-end supervised learning architecture with an audio branch and a video branch. We not only consider the matching pairs in the common space but also compute the mismatching pairs when maximizing the correlation. In particular, two significant contributions are made. First, a better representation by constructing a deep triplet neural network with triplet loss for optimal projections can be generated to maximize correlation in the shared subspace. Second, positive examples and negative examples are used in the learning stage to improve the capability of embedding learning between audio and video. Our experiment is run over fivefold cross validation, where average performance is applied to demonstrate the performance of audio-video cross-modal retrieval. The experimental results achieved on two different audio-visual datasets show that the proposed learning architecture with two branches outperforms existing six canonical correlation analysis-based methods and four state-of-the-art-based cross-modal retrieval methods.
机译:跨模型检索旨在通过另一种模态的查询检索一个模态中的数据,这是多媒体,信息检索和计算机视觉和数据库领域的非常有趣的研究问题。大多数现有的作品侧重于文本图像,文本 - 视频和歌词的跨模态检索。很少的研究解决了音频和视频之间的跨模型检索由于有限的音频 - 视频成对数据集和语义信息。视听跨模型检索任务的主要挑战侧重于从共享子空间学习联合嵌入,以计算不同模式的相似性,其中生成新表示是最大化音频和视觉模态空间之间的相关性。在这项工作中,我们提出了一种新的Dead Trioll神经网络,具有集群典型相关分析的新型三重态神经网络,这是一种带有音频分支和视频分支的端到端监督学习架构。我们不仅考虑公共空间中的匹配对,而且在最大化相关性时也计算不匹配对。特别是,制定了两项重大贡献。首先,通过构造具有三重态损耗的深度三态神经网络的更好的表示,可以生成用于最佳投影的三重损耗,以最大化共享子空间中的相关性。第二,正示例和否定示例用于学习阶段,以提高嵌入音频和视频之间的学习的能力。我们的实验运行超过五倍交叉验证,其中应用了平均性能来展示音频 - 视频跨模型检索的性能。在两个不同的视听数据集上实现的实验结果表明,具有两个分支的建议的学习架构优于现有的六种基于六个规范相关分析的方法和四种基于最新的基于跨模型检索方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号