首页> 外文期刊>Multimedia, IEEE Transactions on >Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM
【24h】

Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM

机译:通过SVM使用传感器融合的多模式多通道在线扬声器数字化

获取原文
获取原文并翻译 | 示例
           

摘要

Speaker diarization (SD) is the process of assigning speech segments of an audio stream to its corresponding speakers, thus comprising the problem of voice activity detection (VAD), speaker labeling/identification, and often sound source localization (SSL). Most research activities in the past aimed towards applications as broadcast news, meetings, conversational telephony, and automatic multimodal data annotation, where SD may be performed off-line. However, a recent research focus is human–computer interaction (HCI) systems where SD must be performed on-line, and in real-time, as in modern gaming devices and interaction with large displays. Often, such applications further suffer from noise, reverberations, and overlapping speech, making them increasingly challenging. In such situations, multimodal/multisensory approaches can provide more accurate results than unimodal ones, given a data stream may compensate for occasional instabilities of other modalities. Accordingly, this paper presents an on-line multimodal SD algorithm designed to work in a realistic environment with multiple, overlapping speakers. Our work employs a microphone array, a color camera, and a depth sensor as input streams, from which speech-related features are extracted to be later merged through a support vector machine approach consisting of VAD and SSL modules. Speaker identification is incorporated through a hybrid technique of face positioning history and face recognition. Our final SD approach experimentally achieves an average diarization error rate of 11.48% in scenarios with up to three simultaneous speakers, and is able to run .
机译:说话者区分(SD)是将音频流的语音段分配给其相应说话者的过程,因此存在语音活动检测(VAD),说话者标记/标识以及通常是声源定位(SSL)的问题。过去,大多数研究活动都针对广播新闻,会议,对话电话和自动多模式数据注释等应用,其中SD可以离线执行。但是,近来的研究重点是人机交互(HCI)系统,其中SD必须像现代游戏设备以及与大型显示器的交互一样实时且实时地执行。通常,此类应用还会遭受噪声,混响和语音重叠的困扰,这使它们越来越具有挑战性。在这种情况下,给定数据流可以补偿其他形式偶尔出现的不稳定性,与单峰方法相比,多峰/多感觉方法可以提供更准确的结果。因此,本文提出了一种在线多模式SD算法,该算法设计为在具有多个重叠扬声器的现实环境中工作。我们的工作采用麦克风阵列,彩色摄像机和深度传感器作为输入流,从中提取语音相关特征,然后通过由VAD和SSL模块组成的支持向量机方法进行合并。通过面部定位历史和面部识别的混合技术来合并说话者识别。我们的最终SD方法在多达三个并发扬声器的情况下实验性地实现了11.48%的平均误差误差率,并且能够运行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号