首页> 外文期刊>Computer speech and language >Cross database audio visual speech adaptation for phonetic spoken term detection
【24h】

Cross database audio visual speech adaptation for phonetic spoken term detection

机译:跨数据库视听语音自适应,用于语音口语检测

获取原文
获取原文并翻译 | 示例

摘要

Spoken term detection (STD), the process of finding all occurrences of a specified search term in a large amount of speech segments, has many applications in multimedia search and retrieval of information. It is known that use of video information in the form of lip movements can improve the performance of STD in the presence of audio noise. However, research in this direction has been hampered by the unavailability of large annotated audio visual databases for development. We propose a novel approach to develop audio visual spoken term detection when only a small (low resource) audio visual database is available for development. First, cross database training is proposed as a novel framework using the fused hidden Markov modeling (HMM) technique, which is used to train an audio model using extensive large and publicly available audio databases; then it is adapted to the visual data of the given audio visual database. This approach is shown to perform better than standard HMM joint-training method and also improves the performance of spoken term detection when used in the indexing stage. In another attempt, the external audio models are first adapted to the audio data of the given audio visual database and then they are adapted to the visual data. This approach also improves both phone recognition and spoken term detection accuracy. Finally, the cross database training technique is used as HMM initialization, and an extra parameter re-estimation step is applied on the initialized models using Baum Welch technique. The proposed approaches for audio visual model training have allowed for benefiting from both large extensive out of domain audio databases that are available and the small audio visual database that is given for development to create more accurate audio-visual models.
机译:语音术语检测(STD)是在大量语音段中查找指定搜索词所有出现的过程,在多媒体搜索和信息检索中具有许多应用。众所周知,在存在音频噪声的情况下,以嘴唇运动的形式使用视频信息可以改善性病的表现。但是,由于无法使用大型带注释的视听数据库进行开发,因此阻碍了该方向的研究。当只有一个较小的(低资源)视听数据库可供开发时,我们提出了一种开发视听口语检测的新颖方法。首先,提出了使用融合隐马尔可夫建模(HMM)技术将跨数据库训练作为一种新颖的框架,该技术用于使用广泛的大型公共音频数据库来训练音频模型。然后将其适应给定视听数据库的视觉数据。该方法显示出比标准的HMM联合训练方法更好的性能,并且还改善了在索引阶段使用口语术语检测的性能。在另一尝试中,外部音频模型首先适合于给定视听数据库的音频数据,然后它们适合于视觉数据。这种方法还提高了电话识别和口语检测的准确性。最后,将跨数据库训练技术用作HMM初始化,并使用Baum Welch技术对初始化后的模型应用额外的参数重新估计步骤。所提出的用于视听模型训练的方法既可以从可用的大型扩展域外音频数据库中受益,也可以从用于开发以创建更准确的视听模型的小型视听数据库中受益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号