A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems

Sadasivam UMA MAHESWARI; Abdul SHAHINA; Ramesh RISHICKESH; Ahmed NAYEEMULLA KHAN

首页> 外文期刊>Archives of acoustics >A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems

【24h】

A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems

机译：基于CNN基于多媒体ASR系统的伦巴第效应对印地教学单元识别的影响研究

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Research work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech processing research fraternity. The primary objective of this work is to study the exclusive influence of Lombard effect on the automatic recognition of the confusable syllabic consonant-vowel units of Hindi language, as a step towards building robust multimodal ASR systems in adverse environments in the context of Indian languages which are syllabic in nature. The dataset for this work comprises the confusable 145 consonant-vowel (CV) syllabic units of Hindi language recorded simultaneously using three modalities that capture the acoustic and visual speech cues, namely normal acoustic microphone (NM), throat microphone (TM) and a camera that captures the associated lip movements. The Lombard effect is induced by feeding crowd noise into the speaker's headphone while recording. Convolutional Neural Network (CNN) models are built to categorise the CV units based on their place of articulation (POA), manner of articulation (MOA), and vowels (under clean and Lombard conditions). For validation purpose, corresponding Hidden Markov Models (HMM) are also built and tested. Unimodal Automatic Speech Recognition (ASR) systems built using each of the three speech cues from Lombard speech show a loss in recognition of MOA and vowels while POA gets a boost in all the systems due to Lombard effect. Combining the three complimentary speech cues to build bimodal and trimodal ASR systems shows that the recognition loss due to Lombard effect for MOA and vowels reduces compared to the unimodal systems, while the POA recognition is still better due to Lombard effect. A bimodal system is proposed using only alternate acoustic and visual cues which gives a better discrimination of the place and manner of articulation than even standard ASR system. Among the multimodal ASR systems studied, the proposed trimodal system based on Lombard speech gives the best recognition accuracy of 98%, 95%, and 76% for the vowels, MOA and POA, respectively, with an average improvement of 36% over the unimodal ASR systems and 9% improvement over the bimodal ASR systems.

机译：利用声学和视觉提示的鲁棒多模语性语音系统设计的研究工作，利用相对噪声稳健的替代语音传感器提取的近期兴趣在语音处理研究兄弟会中获得兴趣。这项工作的主要目标是研究伦巴第效应对印度语言可混淆音节辅音元音单元的独家影响，作为建立在印度语言的不利环境中的强大多模式ASR系统的一步是特性本质上的。该工作的数据集包括可变的145个辅音元音（CV）Syllabic单位的印地文语言，同时使用捕获声学和视觉语音提示，即正常声学麦克风（NM），喉部麦克风（TM）和相机的三种方式进行同时录制的。这捕获了相关的唇部运动。在录制时，通过将人群噪声送入扬声器的耳机时诱导伦巴第效应。构建卷积神经网络（CNN）模型以基于其特征（POA），铰接方式（MOA）和元音（在清洁和伦巴第情况下）来对CV单元进行分类。对于验证目的，还构建并测试了相应的隐藏马尔可夫模型（HMM）。使用来自伦巴第语音的三个语音线索中的每一个建造的单峰自动语音识别（ASR）系统显示了识别MOA和元音的损失，而POA因伦巴第效应而在所有系统中获得升高。结合三种互补语言提示来构建双峰和三极管和Trimodal ASR系统表明，与单峰系统相比，MOA和元音的伦巴第效应引起的识别损失减少，而POA识别仍然由于伦巴第效应仍然更好。仅使用替代声学和视觉提示提出了一种双峰系统，其提供比甚至标准ASR系统更好地辨别铰接的位置和方式。在研究的多模式ASR系统中，基于伦巴第语音的提议的三峰系统，分别为元音，MOA和POA分别提供了98％，95％和76％的最佳识别准确度，平均改善了36％的单峰ASR系统和对双峰ASR系统的9％改进。

著录项

来源
《Archives of acoustics》 |2020年第3期|419-431|共13页
作者
Sadasivam UMA MAHESWARI; Abdul SHAHINA; Ramesh RISHICKESH; Ahmed NAYEEMULLA KHAN;
展开▼
作者单位

Department of Information Technology SSN College of Engineering Chennai India;

Department of Information Technology SSN College of Engineering Chennai India;

Department of Information Technology SSN College of Engineering Chennai India;

School of Computing Science and Engineering VIT University Chennai India;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Lombard speech; multimodal ASR; throat microphone; visual speech; Convolutional Neural Network; Hidden Markov Model; late fusion; intermediate fusion;

机译：伦巴第讲话;多模式ASR;喉咙麦克风;言语;卷积神经网络;隐马尔可夫模型;晚融合;中间融合;

相似文献

外文文献
中文文献
专利

1. An analysis of the effect of combining standard and alternate sensor signals on recognition of syllabic units for multimodal speech recognition [J] . Radha N., Shahina A., Prabha P., Pattern recognition letters . 2018,第NOVa1期

机译：分析标准和替代传感器信号组合对音节单元识别以进行多模式语音识别的影响
2. PSO-based optimized CNN for Hindi ASR [J] . Vishal Passricha, Rajesh Kumar Aggarwal International journal of speech technology . 2019,第4期

机译：基于PSO的印地文ASR优化CNN
3. Multimodal activity recognition with local block CNN and attention- based spatial weighted CNN [J] . Zhu Suguo, Fang Zhenying, Wang Yi, Journal of visual communication & image representation . 2019,第Apra期

机译：用局部块CNN和基于注意力的空间加权CNN的多模式活动识别
4. Continuous Hindi Speech Recognition Using Kaldi ASR Based on Deep Neural Network [C] . Prashant Upadhyaya, Sanjeev Kumar Mittal, Omar Farooq, International Conference on Machine Intelligence and Signal Analysis . 2019

机译：基于深神经网络的Kaldi ASR，连续印地语语音识别
5. A multimodal fusion approach for automatic postal address recognition system using Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) techniques. [D] . Singh, Amriteshwar. 2011

机译：一种使用光学字符识别（OCR）和自动语音识别（ASR）技术的自动邮政地址识别系统的多模式融合方法。
6. CNN-Based Multimodal Human Recognition in Surveillance Environments [O] . Ja Hyung Koo, Se Woon Cho, Na Rae Baek, 2018

机译：监视环境中基于CNN的多模态人类识别
7. A hybrid CNN-LiGRU acoustic modeling using raw waveform sincnet for Hindi ASR [O] . ANKIT KUMAR, Rajesh Kumar Aggarwal 2020

机译：一种使用原始波形SINCNET进行HINDI ASR的混合CNN-LIGRU声学建模

A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems

摘要

著录项

相似文献

相关主题

期刊订阅