Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

机译：野外视觉语音识别的零射关键词发现

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.

机译：视觉关键字搜寻（KWS）是仅使用视频信息来估计在给定记录中是否发生文本查询的问题。本文着眼于视觉KWS，用于训练期间看不见的单词，这是一种现实世界的实用设置，到目前为止尚未受到社区的关注。为此，我们设计了一种端到端的体系结构，该体系结构包括（a）基于时空残差网络的最先进的视觉特征提取器，（b）基于从序列到序列的音素到音素模型序列神经网络，以及（c）一堆递归神经网络，这些神经网络学习如何将视觉特征与关键字表示相关联。与先前在KWS上尝试仅从音素序列（即字母）中学习单词表示的作品不同，我们建议使用一种音素到音素的编码器-解码器模型，该模型学习如何将单词映射到其发音。我们证明了我们的系统在具有挑战性的LRS2数据库上获得了非常有前途的纯视觉KWS结果，用于训练期间看不到的关键字。我们还表明，我们的系统性能优于通过自动语音识别（ASR）解决KWS的基线，同时与其他最近提出的无ASR的KWS方法相比有了很大的改进。

著录项

来源
《European conference on computer vision》|2018年|536-552|共17页
会议地点
作者
Themos Stafylakis; Georgios Tzimiropoulos;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Visual keyword spotting; Visual speech recognition; Zero-shot learning;

机译：视觉关键词发现;视觉语音识别;零镜头学习;

相似文献

外文文献
中文文献
专利

1. A Russian Keyword Spotting System Based on Large Vocabulary Continuous Speech Recognition and Linguistic Knowledge [J] . Valentin Smirnov, Dmitry Ignatov, Michael Gusev, Journal of electrical and computer engineering . 2016,第PTa2期

机译：基于大词汇量连续语音识别和语言知识的俄语关键词点播系统
2. Audio-visual keyword spotting for access technology in children with cerebral palsy and speech impairment [J] . Orlandi Silvia, Huang Jiaqui, McGillivray Josh, Assistive technology: the official journal of RESNA . 2019,第5期

机译：脑瘫和语音障碍儿童接入技术的视听关键字发现
3. Multi-keyword spotting of telephone speech using a fuzzy search algorithm and keyword-driven two-level CBSM [J] . Chung-Hsien Wu, Yeou-Jiunn Chen Speech Communication . 2001,第3期

机译：使用模糊搜索算法和关键字驱动的两级CBSM的电话语音多关键字识别
4. Fusion Strategies for Robust Speech Recognition and Keyword Spotting for Channel- and Noise-Degraded Speech [C] . Vikramjit Mitra, Julien VanHout, Wen Wang, Annual Conference of the International Speech Communication Association . 2016

机译：融合策略，具有强大的语音识别和渠道和噪声降级语音的关键字发现
5. Improving Keywords Spotting Performance in Noise with Augmented Dataset from Vocoded Speech and Speech Denoising [D] . Li, Ruohao. 2021

机译：从声音语音和语音去噪带来的噪声中的噪声中的关键字
6. Zero-Shot Human Activity Recognition Using Non-Visual Sensors [O] . Fadi Al Machot, Mohammed R. Elkobaisi, Kyandoghere Kyamakya 2020

机译：使用非视觉传感器的零拍人类活动识别
7. Low-Resource Speech Recognition and Keyword-Spotting [O] . Mark J. F. Gales, Kate M. Knill, Anton Ragni 2017

机译：低资源语音识别和关键字斑点
8. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment. [R] . Hansen, J. H. 2015

机译：强大的语音处理和识别：说话者ID，语言ID，语音识别/关键字识别，Diarization / Co-Channel /环境表征，说话者状态评估。

Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

摘要

著录项

相似文献

相关主题

期刊订阅