【24h】

Photo-real talking head with deep bidirectional LSTM

机译:具有深度双向LSTM的照片真实谈角

获取原文

摘要

Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test.
机译:长短期记忆(LSTM)是被设计成时间序列和它们的远距离依赖性比更准确地常规RNNs建模特定回归神经网络(RNN)架构。在本文中,我们建议在我们的照片实际谈话头系统中使用深双向LSTM(BLSTM)进行音频/可视建模。首先将受试者谈话的音频/视觉数据库作为我们的培训数据记录。音频/视觉立体声数据被转换为两个并行时间序列,即,通过在所有训练中施加在下面区域上的主动外观模型(AAM)来通过强制对准文本而获得的上下文标签序列。图像样本。然后,通过最小化从标签序列预测视觉序列的平方误差(SSE)的总和来训练深蓝色的BLSTM以学习回归模型。在测试不同的网络拓扑后,我们有趣地发现最好的网络是两个BLSTM层坐在我们数据集上的一个前馈层的顶部。与我们以前的基于HMM的系统相比,新提出的基于BLSTM的基于深度的基于Blstm的一个更好的客观测量和主观A / B测试。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号