首页> 外文学位 >Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors

【24h】

Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors

机译：使用具有音频，视频和生物医学传感器的深度学习模型，对说话人和情感识别进行多模式传感和数据处理

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The focus of the thesis is on Deep Learning methods and their applications on multimodal data, with a potential to explore the associations between modalities and replace missing and corrupt ones if necessary. We have chosen two important real-world applications that need to deal with multimodal data: 1) Speaker recognition and identification; 2) Facial expression recognition and emotion detection.;The first part of our work assesses the effectiveness of speech-related sensory data modalities and their combinations in speaker recognition using deep learning models. First, the role of electromyography (EMG) is highlighted as a unique biometric sensor in improving audio-visual speaker recognition or as a substitute in noisy or poorly-lit environments. Secondly, the effectiveness of deep learning is empirically confirmed through its higher robustness to all types of features in comparison to a number of commonly used baseline classifiers. Not only do deep models outperform the baseline methods, their power increases when they integrate multiple modalities, as different modalities contain information on different aspects of the data, especially between EMG and audio. Interestingly, our deep learning approach is word-independent. Plus, the EMG, audio, and visual parts of the samples from each speaker do not need to match. This increases the flexibility of our method in using multimodal data, particularly if one or more modalities are missing. With a dataset of 23 individuals speaking 22 words five times, we show that EMG can replace the audio/visual modalities, and when combined, significantly improve the accuracy of speaker recognition.;The second part describes a study on automated emotion recognition using four different modalities---audio, video, electromyography (EMG), and electroencephalography (EEG). We collected a dataset by recording the 4 modalities as 12 human subjects expressed six different emotions or maintained a neutral expression. Three different aspects of emotion recognition were investigated: model selection, feature selection, and data selection. Both generative models (DBNs) and discriminative models (LSTMs) were applied to the four modalities, and from these analyses we conclude that LSTM is better for audio and video together with their corresponding sophisticated feature extractors (MFCC and CNN), whereas DBN is better for both EMG and EEG. By examining these signals at different stages (pre-speech, during-speech, and post-speech) of the current and following trials, we have found that the most effective stages for emotion recognition from EEG occur after the emotion has been expressed, suggesting that the neural signals conveying an emotion are long-lasting.

机译：本文的重点是深度学习方法及其在多模式数据上的应用，有可能探索模式之间的关联，并在必要时替换缺失和损坏的模式。我们选择了两个需要处理多模态数据的重要的实际应用：1）说话人识别和识别； 2）面部表情识别和情感检测。我们的第一部分使用深度学习模型评估与语音相关的感官数据模式及其组合在说话人识别中的有效性。首先，肌电图（EMG）的作用被强调为一种独特的生物识别传感器，可改善视听说话者的识别能力，或在嘈杂或光线不足的环境中作为替代品。其次，与许多常用的基线分类器相比，深度学习对各种类型特征的更高鲁棒性在经验上得到了证实。深度模型不仅超越了基线方法，而且当它们集成了多种模式时它们的功能也会增强，因为不同的模式包含有关数据不同方面的信息，尤其是在EMG和音频之间。有趣的是，我们的深度学习方法与单词无关。另外，每个扬声器的样本的EMG，音频和视觉部分都不需要匹配。这增加了我们使用多模式数据的方法的灵活性，尤其是在缺少一种或多种模式的情况下。通过23个人五次说22个单词的数据集，我们证明了EMG可以代替音频/视频模式，并且结合起来可以显着提高说话人识别的准确性。;第二部分描述了对使用四种不同语言进行自动情感识别的研究方式-音频，视频，肌电图（EMG）和脑电图（EEG）。我们通过记录4种模式来收集数据集，因为12位人类受试者表达了6种不同的情绪或保持中立的表达。研究了情感识别的三个不同方面：模型选择，特征选择和数据选择。生成模型（DBN）和判别模型（LSTM）都应用于这四种模式，从这些分析中我们得出结论，LSTM以及相应的复杂特征提取器（MFCC和CNN）更好地适用于音频和视频，而DBN更好对于EMG和EEG。通过在当前和后续试验的不同阶段（语音转换前，语音转换中和语音转换后）检查这些信号，我们发现从脑电图识别情绪的最有效阶段发生在表达情绪后，这表明传达情感的神经信号是持久的。

著录项

作者
Abtahi, Farnaz.;
展开▼
作者单位

City University of New York.;

展开▼
授予单位 City University of New York.;
学科 Computer science.;Artificial intelligence.
学位 Ph.D.
年度 2018
页码 111 p.
总页数 111
原文格式 PDF
正文语种 eng
中图分类
关键词
入库时间 2022-08-17 11:37:07

相似文献

外文文献
中文文献
专利

1. Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video [J] . Wang Zhongmin, Zhou Xiaoxiao, Wang Wenlang, International journal of machine learning and cybernetics . 2020,第4期

机译：使用多模式深度学习在多种心理生理信号和视频中进行情绪识别
2. EmoNets: Multimodal deep learning approaches for emotion recognition in video [J] . Kahou Samira Ebrahimi, Bouthillier Xavier, Lamblin Pascal, Journal on multimodal user interfaces . 2016,第2期

机译：EmoNets：用于视频情感识别的多模式深度学习方法
3. Emotion recognition of audio/speech data using deep learning approaches [J] . Gupta Vedika, Juyal Stuti, Singh Gurvinder Pal, Journal of information and optimization sciences . 2020,第6期

机译：使用深度学习方法对音频/语音数据的情感认识
4. Multi-Modal Multi-Task Deep Learning For Speaker And Emotion Recognition Of TV-Series Data [C] . Sashi Novitasari, Quoc Truong Do, Sakriani Sakti, Oriental COCOSDA - International Conference on Speech Database and Assessments . 2018

机译：电视数据的说话人和情感识别的多模式多任务深度学习
5. Multimodal Data Creation, Fusion, and Recognition of Action Units Using Deep Learning Models [D] . Zhang, Zheng. 2020

机译：使用深度学习模型的多模式数据创建，融合和识别行动单位
6. Sensor Data Acquisition and Multimodal Sensor Fusion for Human Activity Recognition Using Deep Learning [O] . Seungeun Chung, Jiyoun Lim, Kyoung Ju Noh, 2019

机译：深度学习的人类活动识别传感器数据采集和多模式传感器融合
7. Deep Learning Method for Selecting Effective Models and Feature Groups in Emotion Recognition Using an Asian Multimodal Database [O] . Jun-Ho Maeng, Dong-Hyun Kang, Deok-Hwan Kim 2020

机译：使用亚洲多模式数据库选择有效模型和情感识别中的有效模型和特征组的深度学习方法

Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors

摘要

著录项

相似文献

相关主题

期刊订阅