首页> 外文会议>International Conference on Automatic Face and Gesture Recognition >Audio-Visual Emotion Forecasting: Characterizing and Predicting Future Emotion Using Deep Learning
【24h】

Audio-Visual Emotion Forecasting: Characterizing and Predicting Future Emotion Using Deep Learning

机译:视听情绪预测:使用深度学习表征和预测未来的情绪

获取原文

摘要

Emotion forecasting is the task of predicting the future emotion of a speaker-i.e., the emotion label of the future speaking turn-based on the speaker's past and current audiovisual cues. Emotion forecasting systems require new problem formulations that differ from traditional emotion recognition systems. In this paper, we first explore two types of forecasting windows (i.e., analysis windows for which the speaker's emotion is being forecasted): utterance forecasting and time forecasting. Utterance forecasting is based on speaking turns and forecasts what the speaker's emotion will be after one, two, or three speaking turns. Time forecasting forecasts what the speaker's emotion will be after a certain range of time, such as 3-8, 8- 13, and 13-18 seconds. We then investigate the benefit of using the past audio-visual cues in addition to the current utterance. We design emotion forecasting models using deep learning. We compare the performances of fully-connected deep neural network (FC-DNN), deep long short-term memory (D-LSTM), and deep bidirectional long short-term memory (D-BLSTM) recurrent neural networks (RNNs). This allows us to examine the benefit of modeling dynamic patterns in emotion forecasting tasks. Our experimental results on the IEMOCAP benchmark dataset demonstrate that D-BLSTM and D-LSTM outperform FC-DNN by up to 2.42% in unweighted recall. When using both the current and past utterances, deep dynamic models show an improvement of up to 2.39% compared to their performance when using only the current utterance. We further analyze the benefit of using current and past utterance information compared to using the current and randomly chosen utterance information, and we find the performance improvement rises to 7.53%. The novelty in this study comes from its formulation of emotion forecasting problems and the understanding of how current and past audio-visual cues reveal future emotional information.
机译:情绪预测是根据说话者的过去和当前视听线索来预测说话者未来情绪的任务,即未来讲话转折的情绪标签。情绪预测系统需要与传统情绪识别系统不同的新问题表述。在本文中,我们首先探讨两种类型的预测窗口(即预测说话者情绪的分析窗口):话语预测和时间预测。言语预测是基于说话的转弯,并预测说话者在一,两个或三个转弯后的情绪。时间预测可以预测在一定时间范围(例如3-8、8-13、13-18秒)后说话者的情绪。然后,我们将调查使用当前语音之外的其他使用过去的视听提示的好处。我们使用深度学习设计情绪预测模型。我们比较了完全连接的深度神经网络(FC-DNN),深度长期短期记忆(D-LSTM)和深度双向长期短期记忆(D-BLSTM)递归神经网络(RNN)的性能。这使我们能够检查在情绪预测任务中对动态模式进行建模的好处。我们在IEMOCAP基准数据集上的实验结果表明,在未加权召回率方面,D-BLSTM和D-LSTM的性能比FC-DNN高出2.42%。当同时使用当前话语和过去话语时,与仅使用当前话语时的性能相比,深度动态模型显示最多可提高2.39%。与使用当前和随机选择的语音信息相比,我们进一步分析了使用当前和过去的语音信息的好处,我们发现性能提高了7.53%。这项研究的新颖性源于其对情绪预测问题的阐述以及对当前和过去视听线索如何显示未来情绪信息的理解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号