Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM

Peruffo Minotto Vicente; Rosito Jung Claudio; Lee Bowon

首页> 外文期刊>Multimedia, IEEE Transactions on >Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM

【24h】

Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM

机译：通过SVM使用传感器融合的多模式多通道在线扬声器数字化

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Speaker diarization (SD) is the process of assigning speech segments of an audio stream to its corresponding speakers, thus comprising the problem of voice activity detection (VAD), speaker labeling/identification, and often sound source localization (SSL). Most research activities in the past aimed towards applications as broadcast news, meetings, conversational telephony, and automatic multimodal data annotation, where SD may be performed off-line. However, a recent research focus is human–computer interaction (HCI) systems where SD must be performed on-line, and in real-time, as in modern gaming devices and interaction with large displays. Often, such applications further suffer from noise, reverberations, and overlapping speech, making them increasingly challenging. In such situations, multimodal/multisensory approaches can provide more accurate results than unimodal ones, given a data stream may compensate for occasional instabilities of other modalities. Accordingly, this paper presents an on-line multimodal SD algorithm designed to work in a realistic environment with multiple, overlapping speakers. Our work employs a microphone array, a color camera, and a depth sensor as input streams, from which speech-related features are extracted to be later merged through a support vector machine approach consisting of VAD and SSL modules. Speaker identification is incorporated through a hybrid technique of face positioning history and face recognition. Our final SD approach experimentally achieves an average diarization error rate of 11.48% in scenarios with up to three simultaneous speakers, and is able to run .

机译：说话者区分（SD）是将音频流的语音段分配给其相应说话者的过程，因此存在语音活动检测（VAD），说话者标记/标识以及通常是声源定位（SSL）的问题。过去，大多数研究活动都针对广播新闻，会议，对话电话和自动多模式数据注释等应用，其中SD可以离线执行。但是，近来的研究重点是人机交互（HCI）系统，其中SD必须像现代游戏设备以及与大型显示器的交互一样实时且实时地执行。通常，此类应用还会遭受噪声，混响和语音重叠的困扰，这使它们越来越具有挑战性。在这种情况下，给定数据流可以补偿其他形式偶尔出现的不稳定性，与单峰方法相比，多峰/多感觉方法可以提供更准确的结果。因此，本文提出了一种在线多模式SD算法，该算法设计为在具有多个重叠扬声器的现实环境中工作。我们的工作采用麦克风阵列，彩色摄像机和深度传感器作为输入流，从中提取语音相关特征，然后通过由VAD和SSL模块组成的支持向量机方法进行合并。通过面部定位历史和面部识别的混合技术来合并说话者识别。我们的最终SD方法在多达三个并发扬声器的情况下实验性地实现了11.48％的平均误差误差率，并且能够运行。

著录项

来源
《Multimedia, IEEE Transactions on》 |2015年第10期|1694-1705|共12页
作者
Peruffo Minotto Vicente; Rosito Jung Claudio; Lee Bowon;
展开▼
作者单位

Institute of Informatics, Federal University of Rio Grande do Sul., Porto Alegre, Brazil;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Beamforming; SRP-PHAT; multimodal fusion; on-line speaker diarization; sound source localization; speaker labeling; voice activity detection;

机译：波束成形;SRP-PHAT;多峰融合;在线说话人区分;声源定位;说话人标记;语音活动检测;

相似文献

外文文献
中文文献
专利

1. Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis [J] . Cabanas-Molero P., Lucena M., Fuertes J. M., Multimedia Tools and Applications . 2018,第20期

机译：使用音量评估的SRP-PHAT和视频分析为会议提供多峰发言人二分法
2. A Multimodal Approach to Speaker Diarization on TV Talk-Shows [J] . Vallet F., Essid S., Carrive J. Multimedia, IEEE Transactions on . 2013,第3期

机译：电视脱口秀中说话人差异化的一种多模式方法
3. Multimodal Speaker Diarization [J] . Noulas Athanasios, Englebienne Gwenn, Krose Ben J.A. Pattern Analysis and Machine Intelligence, IEEE Transactions on . 2012,第1期

机译：多模式说话人二分法
4. Multimodal Speaker Diarization of Real-World Meetings Using D-Vectors With Spatial Features [C] . Wonjune Kang, Brandon C. Roy, Wesley Chow IEEE International Conference on Acoustics, Speech and Signal Processing . 2020

机译：使用具有空间特征的D矢量对现实世界中的会议进行多模式演讲者区分
5. Bayesian sensor fusion: A framework for using multimodal sensors to estimate target locations and identities in a battlefield scene. [D] . Smith, Michael Joseph. 2003

机译：贝叶斯传感器融合：一种使用多模式传感器估算战场场景中目标位置和身份的框架。
6. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model [O] . Rehan Ahmad, Syed Zubair, Hani Alquhayz, 2019

机译：使用预训练的视听同步模型进行多模态扬声器二分法
7. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model [O] . 2019

机译：使用预先训练的视听同步模型进行多式扬声器日复速度
8. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment. [R] . Hansen, J. H. 2015

机译：强大的语音处理和识别：说话者ID，语言ID，语音识别/关键字识别，Diarization / Co-Channel /环境表征，说话者状态评估。

Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM

摘要

著录项

相似文献

相关主题

期刊订阅