...
首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network
【24h】

Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network

机译:从文本和音频合成谈话面部:AutoEncoder和序列到序列卷积神经网络

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Synthesizing talking face from text and audio is increasingly becoming a direction in human-machine and face-to-face interactions. Although progress has been made, several existing methods either have unsatisfactory co-articulation modeling effects or ignore relations between adjacent inputs. Moreover, some of these methods often train models on shaky head videos or utilize linear-based face parameterization strategies, which further decrease synthesized quality. To address the above issues, this study proposes a sequence-to-sequence convolutional neural network to automatically synthesize talking face video with accurate lip sync. First, an advanced landmark location pipeline is used to accurately locate the facial landmarks, which can effectively reduce landmark shake. Then, a part-based autoencoder is presented to encode face images into a low-dimensional space and obtain compact representations. A sequence-to-sequence network is also presented to encode the relation of neighboring frames with multiple loss functions, and talking faces are synthesized through a reconstruction strategy with a decoder. Experiments on two public audio-visual datasets and a new dataset called CCTV news demonstrate the effectiveness of the proposed method against other state-of-the-art methods. (C) 2020 Elsevier Ltd. All rights reserved.
机译:从文本和音频合成谈话脸越来越多地成为人机和面对面交互的方向。虽然已经取得了进展,但有几种现有方法具有不令人满意的共同关注建模效果或忽略相邻输入之间的关系。此外,这些方法中的一些经常在摇摇欲坠的头视频上培训模型或利用基于线性的面部参数化策略,这进一步降低了合成的质量。为了解决上述问题,本研究提出了一种序列到序列的卷积神经网络,以自动合成具有精确的唇部同步的谈话脸视频。首先,使用先进的地标位置管道来准确地定位面部地标,这可以有效地减少地标抖动。然后,呈现基于零件的AutoEncoder以将面部图像编码为低维空间并获得紧凑的表示。还呈现序列到序列网络以对具有多个损耗功能的相邻帧的关系进行编码,并且通过具有解码器的重建策略来合成谈话面。两个公共视听数据集的实验和名为CCTV新闻的新数据集展示了所提出的方法对其他最先进方法的有效性。 (c)2020 elestvier有限公司保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号