首页> 外文期刊>ACM transactions on multimedia computing communications and applications >Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
【24h】

Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

机译:音乐检索中音频和歌词的深度跨模态相关学习

获取原文
获取原文并翻译 | 示例

摘要

Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and lyrics, should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where intermodal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pretrained Doc2Vec model followed by fully connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: (i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. (ii) And, as for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns the temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.
机译:深度跨模式学习已成功展示了跨模式多媒体检索中的出色性能,其目的是学习不同数据模式之间的联合表示。不幸的是,很少有研究集中在跨模态相关性学习上,在该学习中应考虑不同数据模态的时间结构,例如音频和歌词。从自然界中音乐的时间结构特征出发,我们有动力学习音频和歌词之间的深层顺序相关性。在这项工作中,我们提出了一种深层的跨模态相关学习体系结构,该体系结构涉及用于音频模态和文本模态(歌词)的两分支深度神经网络。将不同模态的数据转换到相同的规范空间,在此空间中,使用模态规范相关分析作为目标函数来计算时间结构的相似性。这是第一项使用深度体系结构来学习音频和歌词之间的时间相关性的研究。预训练的Doc2Vec模型及其后的完全连接的层用于表示歌词。音频分支做出了两个重大贡献,如下所示:(i)我们提出了一个端到端网络来学习音频和歌词之间的交叉模式相关性,其中特征提取和相关性学习同时进行,并且可以学习联合表示通过考虑时间结构。 (ii)并且,对于特征提取,我们进一步通过一小段本地摘要(VGG16特征)来表示音频信号,并应用递归神经网络来计算可更好地学习音乐音频时间结构的紧凑特征。使用音频检索歌词或使用歌词检索音频的实验结果验证了所提出的深度相关学习体系结构在跨模式音乐检索中的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号