...
首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data
【24h】

Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data

机译:从弱标签数据中进行声音事件检测和时频分割

获取原文
获取原文并翻译 | 示例

摘要

Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip. Many supervised SED algorithms rely on strongly labelled data that contains the onset and offset annotations of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence of the sound events is known, without knowing their onset and offset annotations. In this paper, we propose a time-frequency (T-F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem. In training, a segmentation mapping is applied on a T-F representation, such as log mel spectrogram of an audio clip to obtain T-F segmentation masks of sound events. The T-F segmentation masks can be used for separating the sound events from the background scenes in the T-F domain. Then, a classification mapping is applied on the T-F segmentation masks to estimate the presence probabilities of the sound events. We model the segmentation mapping using a convolutional neural network and the classification mapping using a global weighted rank pooling. In SED, predicted onset and offset times can be obtained from the T-F segmentation masks. As a byproduct, separated waveforms of sound events can be obtained from the T-F segmentation masks. We remixed the DCASE 2018 Task 1 acoustic scene data with the DCASE 2018 Task 2 sound events data. When mixing under 0 dB, the proposed method achieved F1 scores of 0.534,0.398, and 0.167 in audio tagging, frame-wise SED and event-wise SED, outperforming the fully connected deep neural network baseline of 0.331,0.237, and 0.120, respectively. In T-F segmentation, we achieved an F1 score of 0.218, where previous methods were not able to do T-F segmentation.
机译:声音事件检测(SED)旨在检测音频剪辑中何时发生声音事件并识别出声音事件。许多受监督的SED算法都依赖于带有强烈标签的数据,这些数据包含声音事件的开始和偏移注释。但是,许多音频标记数据集的标签很微弱,也就是说,仅知道声音事件的存在,而不知道它们的开始和偏移注释。在本文中,我们提出了一种在弱标记数据上训练的时频(T-F)分割框架,以解决声音事件检测和分离问题。在训练中,将分段映射应用于T-F表示(例如音频剪辑的log mel声谱图),以获得声音事件的T-F分段掩码。 T-F分割蒙版可用于将声音事件与T-F域中的背景场景分离。然后,将分类映射应用于T-F分割蒙版以估计声音事件的存在概率。我们使用卷积神经网络对分割映射进行建模,并使用全局加权秩池对分类映射进行建模。在SED中,可以从T-F分割蒙版中获得预测的开始和偏移时间。作为副产品,可以从T-F分割蒙版中获得声音事件的分离波形。我们将DCASE 2018 Task 1声音场景数据与DCASE 2018 Task 2声音事件数据进行了混合。当在0 dB以下混合时,所提出的方法在音频标记,逐帧SED和逐事件SED中获得的F1分数分别为0.534、0.398和0.167,分别优于完全连接的深度神经网络基线0.331、0.237和0.120。 。在T-F细分中,我们的F1得分为0.218,而以前的方法无法进行T-F细分。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号