首页> 外文会议>International Symposium on Chinese Spoken Language Processing >Speech Emotion Recognition using Convolutional Neural Network with Audio Word-based Embedding
【24h】

Speech Emotion Recognition using Convolutional Neural Network with Audio Word-based Embedding

机译:语音情感识别使用卷积神经网络与基于音频基于Word的嵌入

获取原文

摘要

A complete emotional expression typically contains a complex temporal course in a natural conversation. Related research on utterance-level, segment-level and multi-level processing lacks understanding of the underlying relation of emotional speech. In this work, a convolutional neural network (CNN) with audio word-based embedding is proposed for emotion modeling. In this study, vector quantization is first applied to convert the low level features of each speech frame into audio words using k-means algorithm. Word2vec is adopted to convert an input speech utterance into the corresponding audio word vector sequence. Finally, the audio word vector sequences of the training emotional speech data with emotion annotation are used to construct the CNN- based emotion model. The NCKU-ES database, containing seven emotion categories: happiness, boredom, anger, anxiety, sadness, surprise and disgust, was collected and five-fold cross validation was used to evaluate the performance of the proposed CNN-based method for speech emotion recognition. Experimental results show that the proposed method achieved an emotion recognition accuracy of 82.34%, improving by 8.7% compared to the Long Short Term Memory (LSTM)- based method, which faced the challenging issue of long input sequence. Comparing with raw features, the audio word-based embedding achieved an improvement of 3.4% for speech emotion recognition.
机译:完整的情绪表达通常包含自然对话中的复杂时间课程。相关研究对话语水平,分部水平和多级处理缺乏对情绪言论的基础关系的理解。在这项工作中,提出了一种具有音频基于词的嵌入的卷积神经网络(CNN),用于情绪建模。在该研究中,首先应用向量量化以使用K-Means算法将每个语音帧的低级特征转换为音频单词。采用Word2VEC将输入语音发声转换为相应的音频字矢量序列。最后,使用情感注释的训练情绪语音数据的音频词矢量序列来构建基于CNN的情感模型。 Ncku-es数据库,包含七种情感类别:幸福,无聊,愤怒,焦虑,悲伤,惊喜,厌恶,用来逐步验证来评估拟议的基于CNN的语音情感认可方法的表现。实验结果表明,该方法达到了82.34%的情绪识别准确度,与基于长的短期内存(LSTM)的方法相比,增长了8.7%,面临着长输入序列的具有挑战性问题。比较原始特征,基于音频基于字的嵌入实现了语音情绪识别的提高3.4%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号