首页> 外文会议>International conference on text, speech and dialogue >Spatiotemporal Convolutional Features for Lipreading
【24h】

Spatiotemporal Convolutional Features for Lipreading

机译:Lipreading的时空卷积特征

获取原文

摘要

We propose a visual parametrization method for the task of lipreading and audiovisual speech recognition from frontal face videos. The presented features utilize learned spatiotemporal convolutions in a deep neural network that is trained to predict phonemes on a frame level. The network is trained on a manually transcribed moderate size dataset of Czech television broadcast, but we show that the resulting features generalize well to other languages as well. On a publicly available OuluVS dataset, a result of 91% word accuracy was achieved using vanilla convolutional features, and 97.2% after fine tuning - substantial state of the art improvements in this popular benchmark. Contrary to most of the work on lipreading, we also demonstrate usefulness of the proposed parametrization in the task of continuous audiovisual speech recognition.
机译:我们提出了一种视觉参数化方法,用于从正面人脸视频中进行唇读和视听语音识别。提出的功能利用了深度神经网络中的学习时空卷积,该神经网络经过训练可以在帧级别上预测音素。该网络在捷克电视广播的手动转录的中等大小数据集上进行了训练,但是我们证明,所产生的功能也可以很好地推广到其他语言。在可公开获得的OuluVS数据集上,使用香草卷积功能可实现91%的单词准确度,而在进行微调后可达到97.2%的结果-在该流行基准测试中,现有技术水平已有很大改进。与大多数有关唇读的工作相反,我们还证明了建议的参数化在连续视听语音识别任务中的有用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号