...
首页> 外文期刊>Multimedia Tools and Applications >Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
【24h】

Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

机译:使用音量评估的SRP-PHAT和视频分析为会议提供多峰发言人二分法

获取原文
获取原文并翻译 | 示例
           

摘要

Speaker diarization is traditionally defined as the problem of determining "who speaks when" given an audio or video stream. This is an important task in many applications for meeting rooms, including automatic transcription of conversations, camera steering or content summarization. When the room is equipped with microphone arrays and cameras, speakers can be distinguished according to their location and the problem can be addressed through localization techniques. This article proposes a multimodal speaker diarization system for meeting environments based on a modified SRP-PHAT function evaluated on space volumes rather than discrete points. In our system, this function is used in combination with a circular array, enabling audio-based localization based on the selection of local maxima. Voicing detection is used to detect speech frames, whereas video analysis is introduced to aid in the decision when users move or simultaneously speak. The approach is evaluated on the well-known AMI dataset with approximately 100 hours of realistic meeting recordings and shows an average diarization error rate of 21% - 25%.
机译:说话者二元化传统上定义为在给定音频或视频流的情况下确定“谁在说话”的问题。在会议室的许多应用程序中,这是一项重要任务,包括对话的自动转录,摄像机控制或内容摘要。当房间配备麦克风阵列和摄像头时,可以根据扬声器的位置区分扬声器,并可以通过定位技术解决问题。本文提出了一种用于会议环境的多模式扬声器二元化系统,该系统基于修改后的SRP-PHAT函数(对空间量而不是离散点进行评估)。在我们的系统中,此功能与圆形阵列结合使用,可基于局部最大值的选择实现基于音频的定位。语音检测用于检测语音帧,而视频分析则用于辅助用户移动或同时讲话时的决策。该方法在著名的AMI数据集上进行了大约100个小时的真实会议记录评估,结果显示平均偏差误差率为21%-25%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号