首页> 外文会议>International Conference on Automatic Face and Gesture Recognition >Audio-Visual Emotion Forecasting: Characterizing and Predicting Future Emotion Using Deep Learning
【24h】

Audio-Visual Emotion Forecasting: Characterizing and Predicting Future Emotion Using Deep Learning

机译:视听情绪预测:使用深度学习表征和预测未来情感

获取原文

摘要

Emotion forecasting is the task of predicting the future emotion of a speaker-i.e., the emotion label of the future speaking turn-based on the speaker's past and current audiovisual cues. Emotion forecasting systems require new problem formulations that differ from traditional emotion recognition systems. In this paper, we first explore two types of forecasting windows (i.e., analysis windows for which the speaker's emotion is being forecasted): utterance forecasting and time forecasting. Utterance forecasting is based on speaking turns and forecasts what the speaker's emotion will be after one, two, or three speaking turns. Time forecasting forecasts what the speaker's emotion will be after a certain range of time, such as 3-8, 8- 13, and 13-18 seconds. We then investigate the benefit of using the past audio-visual cues in addition to the current utterance. We design emotion forecasting models using deep learning. We compare the performances of fully-connected deep neural network (FC-DNN), deep long short-term memory (D-LSTM), and deep bidirectional long short-term memory (D-BLSTM) recurrent neural networks (RNNs). This allows us to examine the benefit of modeling dynamic patterns in emotion forecasting tasks. Our experimental results on the IEMOCAP benchmark dataset demonstrate that D-BLSTM and D-LSTM outperform FC-DNN by up to 2.42% in unweighted recall. When using both the current and past utterances, deep dynamic models show an improvement of up to 2.39% compared to their performance when using only the current utterance. We further analyze the benefit of using current and past utterance information compared to using the current and randomly chosen utterance information, and we find the performance improvement rises to 7.53%. The novelty in this study comes from its formulation of emotion forecasting problems and the understanding of how current and past audio-visual cues reveal future emotional information.
机译:情感预测是预测扬声器的未来情绪的任务。,未来的情感标签,基于演讲者的过去和当前的视听线索。情绪预测系统需要与传统情感识别系统不同的新问题配方。在本文中,我们首先探索两种类型的预测窗口(即,正在预测扬声器的情绪的分析窗口):话语预测和时间预测。话语预测是基于口语转弯和预测演讲者的情绪将在一个,两个或三个口交之后。时间预测预测演讲者的情绪在一定时间内将是什么,例如3-8,8-13和13-18秒。然后,我们还调查使用过去的视听线索除了当前话语之外的好处。我们使用深入学习设计情感预测模型。我们比较完全连接的深神经网络(FC-DNN),深长短期存储器(D-LSTM)和深双向短期内记忆(D-BLSTM)经常性神经网络(RNN)的性能。这使我们能够检查情绪预测任务中的动态模式建模的好处。我们在IEMocap基准数据集上的实验结果表明,D-BLSTM和D-LSTM在未加权召回中优于2.42%的FC-DNN。在使用当前和过去的话语时,与仅使用当前话语时的性能相比,深度动态模型显示出高达2.39%的增长。我们进一步分析了使用电流和随机选择的话语信息相比使用电流和过去的话语信息的好处,并且我们发现性能提高到7.53%。本研究的新颖性来自其对情绪预测问题的制定,并了解当前和过去的视听提示如何揭示未来的情绪信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号