Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

Shahin Amiriparian; Maurice Gerczuk; Sandra Ottl; Lukas Stappen; Alice Baird; Lukas Koebe; Bj?rn Schuller

摘要

In this paper, we investigate the performance of two deep learning paradigms for the audio-based tasks of acoustic scene, environmental sound and domestic activity classification. In particular, a convolutional recurrent neural network (CRNN) and pre-trained convolutional neural networks (CNNs) are utilised. The CRNN is directly trained on Mel-spectrograms of the audio samples. For the pre-trained CNNs, the activations of one of the top layers of various architectures are extracted as feature vectors and used for training a linear support vector machine (SVM).Moreover, the predictions of the two models—the class probabilities predicted by the CRNN and the decision function of the SVM—are combined in a decision-level fusion to achieve the final prediction. For the pre-trained CNN networks we use as feature extractors, we further evaluate the effects of a range of configuration options, including the choice of the pre-training corpus. The system is evaluated on the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017) workshop, ESC-50 and the multi-channel acoustic recordings from DCASE 2018, task 5. We have refrained from additional data augmentation as our primary goal is to analyse the general performance of the proposed system on different datasets. We show that using our system, it is possible to achieve competitive performance on all datasets and demonstrate the complementarity of CRNNs and ImageNet pre-trained CNNs for acoustic classification tasks. We further find that in some cases, CNNs pre-trained on ImageNet can serve as more powerful feature extractors than AudioSet models. Finally, ImageNet pre-training is complimentary to more domain-specific knowledge, either in the form of the convolutional recurrent neural network (CRNN) trained directly on the target data or the AudioSet pre-trained models. In this regard, our findings indicate possible benefits of applying cross-modal pre-training of large CNNs to acoustic analysis tasks.

机译：在本文中，我们调查了两个深度学习范例对声学场景的基于音频的任务，环境声音和国内活动分类的性能。特别地，利用卷积复发性神经网络（CRNN）和预先训练的卷积神经网络（CNNS）。 CRNN直接培训在音频样本的MEL-谱图上。对于预先训练的CNN，各种架构中的一个顶层的激活被提取为特征向量，并用于训练线性支持向量机（SVM）.Moreover，这两个模型的预测 - 所预测的类概率CRNN和SVM的决策功能 - 在决策级融合中组合以实现最终预测。对于我们用作特征提取器的预先训练的CNN网络，我们进一步评估了一系列配置选项的效果，包括选择预先培训语料库。对系统的声学场景分类任务评估了IEEE AASP挑战的声学场景分类任务，以及来自DCEST 2018的MODS-50和来自DCEAD 2018的多声道声学记录，任务5.我们已经泄露从额外的数据增强作为我们的主要目标是分析所提出的系统在不同数据集中的一般性。我们表明，使用我们的系统，可以在所有数据集上实现竞争性能，并展示CRNNS和Imagenet预先培训的用于声学分类任务的CNN的互补性。我们进一步发现，在某些情况下，在想象中心上预先培训的CNN可以作为比augioset模型更强大的功能提取器。最后，Imagenet预培训是根据直接培训的卷积经常性神经网络（CRNN）的形式提供更多的域特定知识，无论是直接培训的目标数据还是假期预训练模型。在这方面，我们的研究结果表明将大型CNN的跨模型预训练应用于声学分析任务可能的益处。

Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

摘要

著录项

相关主题

期刊订阅