首页> 外文期刊>IEEE Transactions on Circuits and Systems for Video Technology >Time-delay neural networks for estimating lip movements from speechanalysis: a useful tool in audio-video synchronization
【24h】

Time-delay neural networks for estimating lip movements from speechanalysis: a useful tool in audio-video synchronization

机译:延时神经网络,用于从语音分析中估计唇部运动:音频/视频同步中的有用工具

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

A new technology is proposed for audio-video synchronization in multimedia applications where talking human faces, either natural or synthetic, are employed for interpersonal communication services, home gaming, advanced multimodal interfaces, interactive entertainment, or in movie production. Facial sequences, in fact, represent an acoustic-visual source characterized by two strongly correlated components: a talking face and the associated speech, whose synchronous presentation must be guaranteed in any multimedia application. Therefore, the exact timing for displaying a video frame or for generating a synthetic facial image has to be supervised by some form of speech analysis performed either as preprocessing before encoding or as postprocessing before presentation. Experimental results are reported on the use of time-delay neural networks (TDNN) for the direct estimation of the visible articulation of the mouth starting from the coherent analysis of acoustic speech. The architectural solution of employing a bank of independent single-output TDNNs has been compared to the alternative solution of using only a single multi-output TDNN. Similarly, two different learning procedures have been applied and compared for training the TDNN, the first based on the classic mean square error (MSE) and the second based on a measure of cross-correlation (CC). The superiority of the system based on multiple single-output TDNNs has been proved as well as the improvements, both in terms of convergence speed and estimation fidelity, achievable through the learning algorithm based on cross-correlation
机译:提出了一种用于多媒体应用中的音频-视频同步的新技术,在该应用中,使用会说话的自然或合成人脸进行人际通信服务,家庭游戏,高级多模式界面,交互式娱乐或电影制作。实际上,面部序列代表了一种视听源,其特征是两个紧密相关的组件:会说话的脸和相关的语音,在任何多媒体应用中都必须保证其同步呈现。因此,必须通过某种形式的语音分析来监督显示视频帧或生成合成面部图像的确切时间,这些语音分析既可以作为编码前的预处理,也可以作为呈现前的后处理。据报道,使用时延神经网络(TDNN)可以从对语音的连贯分析开始,直接估计嘴的可见关节,得出了实验结果。已将采用一组独立的单输出TDNN的体系结构解决方案与仅使用单个多输出TDNN的替代解决方案进行了比较。类似地,已经应用了两种不同的学习过程并将其进行了比较以训练TDNN,第一种基于经典均方误差(MSE),第二种基于互相关度量(CC)。通过基于互相关的学习算法,可以证明基于多个单输出TDNN的系统的优越性以及在收敛速度和估计保真度方面的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号