Fast speaker diarization using a high-level scripting language

机译：使用高级脚本语言实现快速的说话人区分

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computationally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efficient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difficult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this paper we present a speaker diarization system captured in under 50 lines of Python that achieves 50–250× faster than real-time performance by using a specialization framework to automatically map and execute computationally intensive GMM training on an NVIDIA GPU, without significant loss in accuracy.

机译：当前大多数说话者区分系统使用高斯混合模型（GMM）的聚集聚类来确定音频记录中的“何时说话”。尽管该方法具有最先进的准确性，但在计算上却非常昂贵，这主要是由于进行了GMM训练，因此将当前方法的性能限制为大致实时。当前数据集的大小增加需要处理数百小时的数据，因此迫切需要更有效的处理方法。随着高度并行的多核和多核处理器（例如图形处理单元（GPU））的出现，人们可以通过在训练计算中利用并行性来重新实现GMM训练，以实现比实时性能更快的速度。但是，开发和维护复杂的低级GPU代码很困难，并且需要对并行处理器的硬件体系结构有深入的了解。此外，这样的低级实现在其他应用程序中不易重用，也不可移植到其他平台，从而限制了程序员的工作效率。在本文中，我们介绍了一种在50行以下Python中捕获的说话者区分系统，该系统使用专业化框架在NVIDIA GPU上自动映射并执行计算密集型GMM训练，从而比实时性能快50–250倍，而不会造成重大损失准确性。

著录项

来源
《2011 IEEE Workshop on Automatic Speech Recognition amp; Understanding》|2011年|p.553-558|共6页
会议地点 Waikoloa HI(US)
作者
Gonina Ekaterina; Friedland Gerald; Cook Henry; Keutzer Kurt;
展开▼
作者单位

University of California, Berkeley, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类电声技术和语音信号处理;
关键词

相似文献

外文文献
中文文献
专利

1. Hybridization DE with K-means for speaker clustering in speaker diarization of broadcasts news [J] . Dabbabi Karim, Hajji Salah, Cherif Adnen International journal of speech technology . 2019,第4期

机译：与K-means的混合DE用于演讲者广播新闻的演讲者聚类
2. Probabilistic Speaker Diarization With Bag-of-Words Representations of Speaker Angle Information [J] . Ishiguro K., Yamada T., Araki S., Audio, Speech, and Language Processing, IEEE Transactions on . 2012,第2期

机译：说话者角度信息的词袋表示概率的说话人区分
3. Development of a Speaker Diarization System for Speaker Tracking in Audio Broadcast News: a Case Study [J] . Mihelic France, Vesnicer Bostjan, Zibert Janez Journal of computing and information technology . 2008,第3期

机译：音频广播新闻中演讲者跟踪的演讲者区分系统的开发：一个案例研究
4. Fast speaker diarization using a high-level scripting language [C] . Gonina Ekaterina, Friedland Gerald, Cook Henry, IEEE Workshop on Automatic Speech Recognition Understanding . 2011

机译：使用高级脚本语言快速扬声器日复速度
5. Automatic Speaker Recognition and Diarization in Co-Channel Speech [D] . Shokouhi, Navid. 2017

机译：同频道语音中的说话人自动识别和区分
6. Supervised Speaker Diarization Using Random Forests: A Tool for Psychotherapy Process Research [O] . Lukas Fürer, Nathalie Schenk, Volker Roth, 2020

机译：使用随机森林监督扬声器日期：一种心理治疗过程研究的工具
7. Fast Speaker Diarization Using a High-Level Scripting Language [O] . Ekaterina Gonina, Gerald Friedl, Henry Cook, 2012

机译：使用高级脚本语言进行快速说话人员分类
8. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment. [R] . Hansen, J. H. 2015

机译：强大的语音处理和识别：说话者ID，语言ID，语音识别/关键字识别，Diarization / Co-Channel /环境表征，说话者状态评估。

Fast speaker diarization using a high-level scripting language

摘要

著录项

相似文献

相关主题

期刊订阅