Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

Cabanas-Molero P.; Lucena M.; Fuertes J. M.; Vera-Candeas P.; Ruiz-Reyes N.

首页> 外文期刊>Multimedia Tools and Applications >Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

【24h】

Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

机译：使用音量评估的SRP-PHAT和视频分析为会议提供多峰发言人二分法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Speaker diarization is traditionally defined as the problem of determining "who speaks when" given an audio or video stream. This is an important task in many applications for meeting rooms, including automatic transcription of conversations, camera steering or content summarization. When the room is equipped with microphone arrays and cameras, speakers can be distinguished according to their location and the problem can be addressed through localization techniques. This article proposes a multimodal speaker diarization system for meeting environments based on a modified SRP-PHAT function evaluated on space volumes rather than discrete points. In our system, this function is used in combination with a circular array, enabling audio-based localization based on the selection of local maxima. Voicing detection is used to detect speech frames, whereas video analysis is introduced to aid in the decision when users move or simultaneously speak. The approach is evaluated on the well-known AMI dataset with approximately 100 hours of realistic meeting recordings and shows an average diarization error rate of 21% - 25%.

机译：说话者二元化传统上定义为在给定音频或视频流的情况下确定“谁在说话”的问题。在会议室的许多应用程序中，这是一项重要任务，包括对话的自动转录，摄像机控制或内容摘要。当房间配备麦克风阵列和摄像头时，可以根据扬声器的位置区分扬声器，并可以通过定位技术解决问题。本文提出了一种用于会议环境的多模式扬声器二元化系统，该系统基于修改后的SRP-PHAT函数（对空间量而不是离散点进行评估）。在我们的系统中，此功能与圆形阵列结合使用，可基于局部最大值的选择实现基于音频的定位。语音检测用于检测语音帧，而视频分析则用于辅助用户移动或同时讲话时的决策。该方法在著名的AMI数据集上进行了大约100个小时的真实会议记录评估，结果显示平均偏差误差率为21％-25％。

著录项

来源
《Multimedia Tools and Applications》 |2018年第20期|27685-27707|共23页
作者
Cabanas-Molero P.; Lucena M.; Fuertes J. M.; Vera-Candeas P.; Ruiz-Reyes N.;
展开▼
作者单位

Univ Jaen, Dept Telecommun Engn, Jaen, Spain;

Univ Jaen, Dept Comp Sci, Jaen, Spain;

Univ Jaen, Dept Comp Sci, Jaen, Spain;

Univ Jaen, Dept Telecommun Engn, Jaen, Spain;

Univ Jaen, Dept Telecommun Engn, Jaen, Spain;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Speaker diarization; Meeting rooms; SRP-PHAT; Multimodal processing;

机译：演讲者区分;会议室;SRP-PHAT;多模式处理;

相似文献

外文文献
中文文献
专利

1. Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos [J] . Zhang C., Yin P., Rui Y., IEEE transactions on multimedia . 2008,第8期

机译：基于Boosting的分布式会议视频的多模式发言人检测
2. Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM [J] . Peruffo Minotto Vicente, Rosito Jung Claudio, Lee Bowon Multimedia, IEEE Transactions on . 2015,第10期

机译：通过SVM使用传感器融合的多模式多通道在线扬声器数字化
3. A Multimodal Approach to Speaker Diarization on TV Talk-Shows [J] . Vallet F., Essid S., Carrive J. Multimedia, IEEE Transactions on . 2013,第3期

机译：电视脱口秀中说话人差异化的一种多模式方法
4. Multimodal Speaker Diarization of Real-World Meetings Using D-Vectors With Spatial Features [C] . Wonjune Kang, Brandon C. Roy, Wesley Chow IEEE International Conference on Acoustics, Speech and Signal Processing . 2020

机译：使用具有空间特征的D矢量对现实世界中的会议进行多模式演讲者区分
5. Use of speaker location features in meeting diarization. [D] . Otterson, Scott. 2008

机译：会议发言者使用语音定位功能。
6. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model [O] . Rehan Ahmad, Syed Zubair, Hani Alquhayz, 2019

机译：使用预训练的视听同步模型进行多模态扬声器二分法
7. MULTI-MODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING COMPRESSED-DOMAIN VIDEO FEATURES [O] . Gerald Friedl, Hayley Hung, Chuohao Yeo 2015

机译：使用压缩域视频特性对现实世界会议进行多模式扬声器的演绎
8. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment. [R] . Hansen, J. H. 2015

机译：强大的语音处理和识别：说话者ID，语言ID，语音识别/关键字识别，Diarization / Co-Channel /环境表征，说话者状态评估。

Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

摘要

著录项

相似文献

相关主题

期刊订阅