首页> 外文会议>European conference on computer vision >Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild
【24h】

Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

机译:野外视觉语音识别的零射关键词发现

获取原文

摘要

Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.
机译:视觉关键字搜寻(KWS)是仅使用视频信息来估计在给定记录中是否发生文本查询的问题。本文着眼于视觉KWS,用于训练期间看不见的单词,这是一种现实世界的实用设置,到目前为止尚未受到社区的关注。为此,我们设计了一种端到端的体系结构,该体系结构包括(a)基于时空残差网络的最先进的视觉特征提取器,(b)基于从序列到序列的音素到音素模型序列神经网络,以及(c)一堆递归神经网络,这些神经网络学习如何将视觉特征与关键字表示相关联。与先前在KWS上尝试仅从音素序列(即字母)中学习单词表示的作品不同,我们建议使用一种音素到音素的编码器-解码器模型,该模型学习如何将单词映射到其发音。我们证明了我们的系统在具有挑战性的LRS2数据库上获得了非常有前途的纯视觉KWS结果,用于训练期间看不到的关键字。我们还表明,我们的系统性能优于通过自动语音识别(ASR)解决KWS的基线,同时与其他最近提出的无ASR的KWS方法相比有了很大的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号