首页> 外文期刊>Neural computation >Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning
【24h】

Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning

机译:使用深度学习评估听觉和视听语音预测编码的潜在增益

获取原文
获取原文并翻译 | 示例

摘要

Sensory processing is increasingly conceived in a predictive frameworkin which neurons would constantly process the error signal resultingfrom the comparison of expected and observed stimuli. Surprisingly,few data exist on the accuracy of predictions that can be computedin real sensory scenes. Here, we focus on the sensory processing ofauditory and audiovisual speech. We propose a set of computationalmodels based on artificial neural networks (mixing deep feedforwardand convolutional networks), which are trained to predict future audioobservations from present and past audio or audiovisual observations(i.e., including lip movements). Those predictions exploit purelylocal phonetic regularities with no explicit call to higher linguistic levels.Experiments are conducted on the multispeaker LibriSpeech audiospeech database (around 100 hours) and on the NTCD-TIMIT audiovisualspeech database (around 7 hours). They appear to be efficient in ashort temporal range (25–50 ms), predicting 50% to 75% of the varianceof the incoming stimulus, which could result in potentially saving upto three-quarters of the processing power. Then they quickly decreaseand almost vanish after 250 ms. Adding information on the lips slightlyimproves predictions, with a 5% to 10% increase in explained variance.Interestingly the visual gain vanishes more slowly, and the gain is maximumfor a delay of 75 ms between image and predicted sound.
机译:在预测框架中越来越多地想到感觉处理,在该预测框架中,神经元将不断处理由预期和观察到的刺激的比较产生的误差信号。令人惊讶的是,关于可在真实感官场景中计算的预测准确性的数据很少。在这里,我们专注于听觉和视听语音的感觉处理。我们提出了一组基于人工神经网络(混合深度前馈和卷积网络)的计算模型,这些模型经过训练可以根据当前和过去的音频或视听观察结果预测未来的音频观察结果(包括嘴唇运动)。这些预测利用了纯本地的语音规律,而没有明确调用更高的语言水平。实验是在多扬声器LibriSpeech语音数据库(约100小时)和NTCD-TIMIT视听语音数据库(约7小时)上进行的。它们似乎在较短的时间范围内(25–50 ms)是有效的,可以预测传入刺激变化的50%至75%,这可能会节省多达四分之三的处理能力。然后它们迅速减小,并在250 ms之后几乎消失。在嘴唇上添加信息会稍微改善预测效果,解释方差增加5%到10%。有趣的是,视觉增益消失得更慢,并且在图像和预测声音之间的延迟时间为75 ms时增益最大。

著录项

  • 来源
    《Neural computation》 |2020年第3期|596-625|共30页
  • 作者单位

    Universite Grenoble Alpes CNRS Grenoble INP GIPSA-lab 38000 Grenoble France;

    Universite Grenoble Alpes CNRS Grenoble INP GIPSA-lab 38000 Grenoble France and Inria Grenoble-Rhone-Alpes 38330 Montbonnot-Saint Martin France;

  • 收录信息 美国《科学引文索引》(SCI);美国《化学文摘》(CA);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号