首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Generating Intelligible Audio Speech From Visual Speech
【24h】

Generating Intelligible Audio Speech From Visual Speech

机译:从视觉语音生成可理解的音频语音

获取原文
获取原文并翻译 | 示例

摘要

This paper is concerned with generating intelligible audio speech from a video of a person talking. Regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model visual features. Two further methods are then developed to incorporate temporal information into the prediction: A feature-level method using multiple frames and a model-level method based on recurrent neural networks. Speech excitation information is not available from the visual signal, so methods to artificially generate aperiodicity and fundamental frequency are developed. These are combined within the STRAIGHT vocoder to produce a speech signal. The various systems are optimized through objective tests before applying subjective intelligibility tests that determine a word accuracy of 85% from a set of human listeners on the GRID audio–visual speech database. This compares favorably with a previous regression-based system that serves as a baseline, which achieved a word accuracy of 33%.
机译:本文涉及从说话人的视频生成可理解的语音。首先提出回归和分类方法,以从活动外观模型的视觉特征估计静态光谱包络特征。然后,开发了两种其他方法来将时间信息合并到预测中:使用多个框架的特征级方法和基于递归神经网络的模型级方法。从视觉信号中无法获得语音激励信息,因此开发了人工产生非周期性和基频的方法。这些在STRAIGHT声码器中组合在一起以产生语音信号。在应用主观清晰度测试之前,通过客观测试对各种系统进行了优化,这些测试确定了GRID视听语音数据库中一组人类听众的单词准确性为85%。与以前的基于回归的基于基线的系统相比,该系统具有33%的字准确度,因此非常有利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号