首页> 外文期刊>EURASIP journal on audio, speech, and music processing >Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks
【24h】

Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

机译:利用卷积和经常性神经网络对音频识别的跨模型预训练和学习节奏空间特征

获取原文
       

摘要

In this paper, we investigate the performance of two deep learning paradigms for the audio-based tasks of acoustic scene, environmental sound and domestic activity classification. In particular, a convolutional recurrent neural network (CRNN) and pre-trained convolutional neural networks (CNNs) are utilised. The CRNN is directly trained on Mel-spectrograms of the audio samples. For the pre-trained CNNs, the activations of one of the top layers of various architectures are extracted as feature vectors and used for training a linear support vector machine (SVM).Moreover, the predictions of the two models—the class probabilities predicted by the CRNN and the decision function of the SVM—are combined in a decision-level fusion to achieve the final prediction. For the pre-trained CNN networks we use as feature extractors, we further evaluate the effects of a range of configuration options, including the choice of the pre-training corpus. The system is evaluated on the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017) workshop, ESC-50 and the multi-channel acoustic recordings from DCASE 2018, task 5. We have refrained from additional data augmentation as our primary goal is to analyse the general performance of the proposed system on different datasets. We show that using our system, it is possible to achieve competitive performance on all datasets and demonstrate the complementarity of CRNNs and ImageNet pre-trained CNNs for acoustic classification tasks. We further find that in some cases, CNNs pre-trained on ImageNet can serve as more powerful feature extractors than AudioSet models. Finally, ImageNet pre-training is complimentary to more domain-specific knowledge, either in the form of the convolutional recurrent neural network (CRNN) trained directly on the target data or the AudioSet pre-trained models. In this regard, our findings indicate possible benefits of applying cross-modal pre-training of large CNNs to acoustic analysis tasks.
机译:在本文中,我们调查了两个深度学习范例对声学场景的基于音频的任务,环境声音和国内活动分类的性能。特别地,利用卷积复发性神经网络(CRNN)和预先训练的卷积神经网络(CNNS)。 CRNN直接培训在音频样本的MEL-谱图上。对于预先训练的CNN,各种架构中的一个顶层的激活被提取为特征向量,并用于训练线性支持向量机(SVM).Moreover,这两个模型的预测 - 所预测的类概率CRNN和SVM的决策功能 - 在决策级融合中组合以实现最终预测。对于我们用作特征提取器的预先训练的CNN网络,我们进一步评估了一系列配置选项的效果,包括选择预先培训语料库。对系统的声学场景分类任务评估了IEEE AASP挑战的声学场景分类任务,以及来自DCEST 2018的MODS-50和来自DCEAD 2018的多声道声学记录,任务5.我们已经泄露从额外的数据增强作为我们的主要目标是分析所提出的系统在不同数据集中的一般性。我们表明,使用我们的系统,可以在所有数据集上实现竞争性能,并展示CRNNS和Imagenet预先培训的用于声学分类任务的CNN的互补性。我们进一步发现,在某些情况下,在想象中心上预先培训的CNN可以作为比augioset模型更强大的功能提取器。最后,Imagenet预培训是根据直接培训的卷积经常性神经网络(CRNN)的形式提供更多的域特定知识,无论是直接培训的目标数据还是假期预训练模型。在这方面,我们的研究结果表明将大型CNN的跨模型预训练应用于声学分析任务可能的益处。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号