Little research focuses on cross-modal correlation learning where temporalstructures of different data modalities such as audio and lyrics are taken intoaccount. Stemming from the characteristic of temporal structures of music innature, we are motivated to learn the deep sequential correlation between audioand lyrics. In this work, we propose a deep cross-modal correlation learningarchitecture involving two-branch deep neural networks for audio modality andtext modality (lyrics). Different modality data are converted to the samecanonical space where inter modal canonical correlation analysis is utilized asan objective function to calculate the similarity of temporal structures. Thisis the first study on understanding the correlation between language and musicaudio through deep architectures for learning the paired temporal correlationof audio and lyrics. Pre-trained Doc2vec model followed by fully-connectedlayers (fully-connected deep neural network) is used to represent lyrics. Twosignificant contributions are made in the audio branch, as follows: i)pre-trained CNN followed by fully-connected layers is investigated forrepresenting music audio. ii) We further suggest an end-to-end architecturethat simultaneously trains convolutional layers and fully-connected layers tobetter learn temporal structures of music audio. Particularly, our end-to-enddeep architecture contains two properties: simultaneously implementing featurelearning and cross-modal correlation learning, and learning jointrepresentation by considering temporal structures. Experimental results, usingaudio to retrieve lyrics or using lyrics to retrieve audio, verify theeffectiveness of the proposed deep correlation learning architectures incross-modal music retrieval.
展开▼