...
首页> 外文期刊>Quality Control, Transactions >Towards Understanding Attention-Based Speech Recognition Models
【24h】

Towards Understanding Attention-Based Speech Recognition Models

机译:了解基于关注的语音识别模型

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Although the attention-based speech recognition has achieved promising performances, the specific explanation of the intermediate representations remains a black box theory. In this paper, we use the method to visually show and explain continuous encoder outputs. We propose a human-intervened force alignment method to obtain labels for t-distributed stochastic neighbor embedding (t-SNE), and use them to better understand the attention mechanism and the recurrent representations. In addition, we combine t-SNE and canonical correlation analysis (CCA) to analyze the training dynamics of phones in the attention-based model. Experiments are carried on TIMIT and WSJ respectively. The aligned embeddings of the encoder outputs could form sequence manifolds of the ground truth labels. Figures of t-SNE embeddings visually show what representations the encoder shaped into and how the attention mechanism works for the speech recognition. The comparisons between different models, different layers, and different lengths of the utterance show that manifolds are clearer in the shape when outputs are from the deeper layer of the encoder, the shorter utterance, and models with better performances. We also observe that the same symbols from different utterances tend to gather at similar positions, which proves the consistency of our method. Further comparisons are taken between different epochs of the model using t-SNE and CCA. The results show that both the plosive and the nasal/flap phones converge quickly, while the long vowel phone converge slowly.
机译:虽然基于注意力的语音识别已经取得了有希望的表现,但是中间陈述的具体解释仍然是黑匣子理论。在本文中,我们使用该方法在视觉上显示并解释连续的编码器输出。我们提出了一种人间干预力对准方法,以获得用于T分布式随机邻居嵌入(T-SNE)的标签,并使用它们以更好地了解注意力机制和经常性表示。此外,我们结合了T-SNE和规范相关分析(CCA)来分析基于注意的模型中手机的训练动力学。实验分别在Timit和WSJ上进行。编码器输出的对齐嵌入可以形成地面真理标签的序列歧管。 T-SNE Embeddings的数据在视觉上显示编码器形状的陈述以及语音识别的注意力机制如何。不同型号,不同层和不同长度的话语之间的比较表明,当输出来自编码器的更深层,较短的话语和具有更好性能的模型时,歧管的形状更加清晰。我们还观察到来自不同话语的相同符号倾向于聚集在类似的位置,这证明了我们方法的一致性。使用T-SNE和CCA在模型的不同时期之间采集进一步的比较。结果表明,涂​​层和鼻腔/翼片手机都会迅速收敛,而长元音手机慢慢收敛。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号