首页> 外文学位 >Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors
【24h】

Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors

机译:使用具有音频,视频和生物医学传感器的深度学习模型,对说话人和情感识别进行多模式传感和数据处理

获取原文
获取原文并翻译 | 示例

摘要

The focus of the thesis is on Deep Learning methods and their applications on multimodal data, with a potential to explore the associations between modalities and replace missing and corrupt ones if necessary. We have chosen two important real-world applications that need to deal with multimodal data: 1) Speaker recognition and identification; 2) Facial expression recognition and emotion detection.;The first part of our work assesses the effectiveness of speech-related sensory data modalities and their combinations in speaker recognition using deep learning models. First, the role of electromyography (EMG) is highlighted as a unique biometric sensor in improving audio-visual speaker recognition or as a substitute in noisy or poorly-lit environments. Secondly, the effectiveness of deep learning is empirically confirmed through its higher robustness to all types of features in comparison to a number of commonly used baseline classifiers. Not only do deep models outperform the baseline methods, their power increases when they integrate multiple modalities, as different modalities contain information on different aspects of the data, especially between EMG and audio. Interestingly, our deep learning approach is word-independent. Plus, the EMG, audio, and visual parts of the samples from each speaker do not need to match. This increases the flexibility of our method in using multimodal data, particularly if one or more modalities are missing. With a dataset of 23 individuals speaking 22 words five times, we show that EMG can replace the audio/visual modalities, and when combined, significantly improve the accuracy of speaker recognition.;The second part describes a study on automated emotion recognition using four different modalities---audio, video, electromyography (EMG), and electroencephalography (EEG). We collected a dataset by recording the 4 modalities as 12 human subjects expressed six different emotions or maintained a neutral expression. Three different aspects of emotion recognition were investigated: model selection, feature selection, and data selection. Both generative models (DBNs) and discriminative models (LSTMs) were applied to the four modalities, and from these analyses we conclude that LSTM is better for audio and video together with their corresponding sophisticated feature extractors (MFCC and CNN), whereas DBN is better for both EMG and EEG. By examining these signals at different stages (pre-speech, during-speech, and post-speech) of the current and following trials, we have found that the most effective stages for emotion recognition from EEG occur after the emotion has been expressed, suggesting that the neural signals conveying an emotion are long-lasting.
机译:本文的重点是深度学习方法及其在多模式数据上的应用,有可能探索模式之间的关联,并在必要时替换缺失和损坏的模式。我们选择了两个需要处理多模态数据的重要的实际应用:1)说话人识别和识别; 2)面部表情识别和情感检测。我们的第一部分使用深度学习模型评估与语音相关的感官数据模式及其组合在说话人识别中的有效性。首先,肌电图(EMG)的作用被强调为一种独特的生物识别传感器,可改善视听说话者的识别能力,或在嘈杂或光线不足的环境中作为替代品。其次,与许多常用的基线分类器相比,深度学习对各种类型特征的更高鲁棒性在经验上得到了证实。深度模型不仅超越了基线方法,而且当它们集成了多种模式时它们的功能也会增强,因为不同的模式包含有关数据不同方面的信息,尤其是在EMG和音频之间。有趣的是,我们的深度学习方法与单词无关。另外,每个扬声器的样本的EMG,音频和视觉部分都不需要匹配。这增加了我们使用多模式数据的方法的灵活性,尤其是在缺少一种或多种模式的情况下。通过23个人五次说22个单词的数据集,我们证明了EMG可以代替音频/视频模式,并且结合起来可以显着提高说话人识别的准确性。;第二部分描述了对使用四种不同语言进行自动情感识别的研究方式-音频,视频,肌电图(EMG)和脑电图(EEG)。我们通过记录4种模式来收集数据集,因为12位人类受试者表达了6种不同的情绪或保持中立的表达。研究了情感识别的三个不同方面:模型选择,特征选择和数据选择。生成模型(DBN)和判别模型(LSTM)都应用于这四种模式,从这些分析中我们得出结论,LSTM以及相应的复杂特征提取器(MFCC和CNN)更好地适用于音频和视频,而DBN更好对于EMG和EEG。通过在当前和后续试验的不同阶段(语音转换前,语音转换中和语音转换后)检查这些信号,我们发现从脑电图识别情绪的最有效阶段发生在表达情绪后,这表明传达情感的神经信号是持久的。

著录项

  • 作者

    Abtahi, Farnaz.;

  • 作者单位

    City University of New York.;

  • 授予单位 City University of New York.;
  • 学科 Computer science.;Artificial intelligence.
  • 学位 Ph.D.
  • 年度 2018
  • 页码 111 p.
  • 总页数 111
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:37:07

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号