首页> 外文会议>Annual Conference of the International Speech Communication Association >Language Identification based on Generative Modeling of Posteriorgram Sequences Extracted from Frame-by-Frame DNNs and LSTM-RNNs
【24h】

Language Identification based on Generative Modeling of Posteriorgram Sequences Extracted from Frame-by-Frame DNNs and LSTM-RNNs

机译:基于从帧内DNN和LSTM-RNN中提取的后图序列的生成建模的语言识别

获取原文

摘要

This paper aims to enhance spoken language identification methods based on direct discriminative modeling of language labels using deep neural networks (DNNs) and long short-term memory recurrent neural networks (LSTM-RNNs). In conventional methods, frame-by-frame DNNs or LSTM-RNNs are used for utterance-level classification. Although they have strong frame-level classification performance and real-time efficiency, they are not optimized for variable length utterance-level classification since the classification is conducted by simply averaging frame-level prediction results. In addition, the simple classification methodology cannot fully utilize the combination of DNNs and LSTM-RNNs. To address these issues, our idea is to combine the frame-by-frame DNNs and LSTM-RNNs with a sequential generative model based classifier. In the proposed method, we regard posteriorgram sequences generated from a frame-by-frame classifier as feature sequences, and model them with respect to each language using language modeling technologies. The generative model based classifier does not model an identification boundary, so we can flexibly deal with variable length utterances without loss of conventional advantages. Furthermore, the proposed method can support the combination of DNNs and LSTMs using joint posteriorgram sequences, those of generative modeling can capture differences between two posteriorgram sequences. Experiments conducted using the GlobalPhone database demonstrate the proposed method's effectiveness.
机译:本文旨在利用深神经网络(DNN)和长短期内存经常性神经网络(LSTM-RNNS)基于基于直接辨别语言标签的语言鉴定方法来提高口语识别方法。在传统方法中,逐帧DNN或LSTM-RNN用于话语级分类。虽然它们具有强大的帧级分类性能和实时效率,但由于简单地平均帧级预测结果进行了分类,因此它们没有针对可变长度的话语级分类进行优化。此外,简单的分类方法无法充分利用DNN和LSTM-RNN的组合。为了解决这些问题,我们的想法是将帧框架DNN和LSTM-RNN与基于顺序生成模型的分类器组合。在所提出的方法中,我们将从帧帧分类器生成的后视序列视为特征序列,并使用语言建模技术对它们进行模拟。基于生成模型的分类器不会绘制识别边界,因此我们可以灵活地处理可变长度的发声,而不会损失传统的优势。此外,所提出的方法可以使用联合后速序列支持DNN和LSTM的组合,生成建模的那些可以捕获两个后验序列之间的差异。使用Globalphone数据库进行的实验表明了提出的方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号