Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval

Zeng Donghuo; Yu Yi; Oyama Keizo

首页> 外文期刊>ACM transactions on multimedia computing communications and applications >Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval

【24h】

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval

机译：具有Cluster-CCA的深度三态神经网络，用于视听跨模型检索

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Cross-modal retrieval aims to retrieve data in one modality by a query in another modality, which has been a very interesting research issue in the field of multimedia, information retrieval, and computer vision, and database. Most existing works focus on cross-modal retrieval between text-image, text-video, and lyricsaudio. Little research addresses cross-modal retrieval between audio and video due to limited audio-video paired datasets and semantic information. The main challenge of the audio-visual cross-modal retrieval task focuses on learning joint embeddings from a shared subspace for computing the similarity across different modalities, where generating new representations is to maximize the correlation between audio and visual modalities space. In this work, we propose TNN-C-CCA, a novel deep triplet neural network with cluster canonical correlation analysis, which is an end-to-end supervised learning architecture with an audio branch and a video branch. We not only consider the matching pairs in the common space but also compute the mismatching pairs when maximizing the correlation. In particular, two significant contributions are made. First, a better representation by constructing a deep triplet neural network with triplet loss for optimal projections can be generated to maximize correlation in the shared subspace. Second, positive examples and negative examples are used in the learning stage to improve the capability of embedding learning between audio and video. Our experiment is run over fivefold cross validation, where average performance is applied to demonstrate the performance of audio-video cross-modal retrieval. The experimental results achieved on two different audio-visual datasets show that the proposed learning architecture with two branches outperforms existing six canonical correlation analysis-based methods and four state-of-the-art-based cross-modal retrieval methods.

机译：跨模型检索旨在通过另一种模态的查询检索一个模态中的数据，这是多媒体，信息检索和计算机视觉和数据库领域的非常有趣的研究问题。大多数现有的作品侧重于文本图像，文本 - 视频和歌词的跨模态检索。很少的研究解决了音频和视频之间的跨模型检索由于有限的音频 - 视频成对数据集和语义信息。视听跨模型检索任务的主要挑战侧重于从共享子空间学习联合嵌入，以计算不同模式的相似性，其中生成新表示是最大化音频和视觉模态空间之间的相关性。在这项工作中，我们提出了一种新的Dead Trioll神经网络，具有集群典型相关分析的新型三重态神经网络，这是一种带有音频分支和视频分支的端到端监督学习架构。我们不仅考虑公共空间中的匹配对，而且在最大化相关性时也计算不匹配对。特别是，制定了两项重大贡献。首先，通过构造具有三重态损耗的深度三态神经网络的更好的表示，可以生成用于最佳投影的三重损耗，以最大化共享子空间中的相关性。第二，正示例和否定示例用于学习阶段，以提高嵌入音频和视频之间的学习的能力。我们的实验运行超过五倍交叉验证，其中应用了平均性能来展示音频 - 视频跨模型检索的性能。在两个不同的视听数据集上实现的实验结果表明，具有两个分支的建议的学习架构优于现有的六种基于六个规范相关分析的方法和四种基于最新的基于跨模型检索方法。

著录项

来源
《ACM transactions on multimedia computing communications and applications》 |2020年第3期|76.1-76.23|共23页
作者
Zeng Donghuo; Yu Yi; Oyama Keizo;
展开▼
作者单位

SOKENDAI Natl Inst Informat 2 Chome 1-2 Hitotsubashi Tokyo 1018430 Japan;

SOKENDAI Natl Inst Informat 2 Chome 1-2 Hitotsubashi Tokyo 1018430 Japan;

SOKENDAI Natl Inst Informat 2 Chome 1-2 Hitotsubashi Tokyo 1018430 Japan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Deep triplet neural networks; cluster-CCA; cross-modal retrieval; triplet loss;

机译：深度三重态神经网络;Cluster-CCA;交叉模态检索;三重态损失;

相似文献

外文文献
中文文献
专利

1. Hybrid Deep Neural Network-Based Cross-Modal Image and Text Retrieval Method for Large-Scale Data [J] . Qiang Baohua, Chen Ruidong, Xie Yuan, Journal of circuits, systems and computers . 2021,第1期

机译：基于混合的深神经网络的跨模态图像和大型数据的文本检索方法
2. Audio-visual feature fusion via deep neural networks for automatic speech recognition [J] . Mohammad Hasan Rahmani, Farshad Almasganj, Seyyed Ali Seyyedsalehi Digital Signal Processing . 2018,第期

机译：通过深度神经网络进行视听功能融合，用于自动语音识别
3. Audio-visual feature fusion via deep neural networks for automatic speech recognition [J] . Mohammad Hasan Rahmani, Farshad Almasganj, Seyyed Ali Seyyedsalehi Digital Signal Processing . 2018,第期

机译：通过深度神经网络进行视听功能融合，用于自动语音识别
4. Cross-modal Deep Learning Applications: Audio-Visual Retrieval [C] . Cong Jin, Tian Zhang, Shouxun Liu, International Conference on Pattern Recognition Workshops . 2021

机译：跨模型深度学习应用：视听检索
5. Cross-Modal Data Retrieval and Generation Using Deep Neural Networks [D] . Udaiyar, Premkumar. 2020

机译：使用深神经网络的跨模型数据检索和生成
6. Triplet Deep Hashing with Joint Supervised Loss Based on Deep Neural Networks [O] . Mingyong Li, Ziye An, Qinmin Wei, 2019

机译：基于深度神经网络的三联体深度哈希联合监督损失
7. Triplet Deep Hashing with Joint Supervised Loss Based on Deep Neural Networks [O] . Mingyong Li, Ziye An, Qinmin Wei, 2019

机译：基于深神经网络的联合监督损失三联散散

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅