Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Shinji Watanabe; Takaaki Hori; Suyoun Kim; John R. Hershey; Tomoki Hayashi

首页> 外文期刊>Selected Topics in Signal Processing, IEEE Journal of >Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

【24h】

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

机译：端到端语音识别的混合CTC /注意架构

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, and language models. It also requires linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of end-to-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder-decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

机译：基于隐马尔可夫模型（HMM）/深度神经网络（DNN）的常规自动语音识别（ASR）是一个非常复杂的系统，由各种模块组成，例如声学，词典和语言模型。它还需要语言资源，例如发音词典，标记化和语音上下文相关树。另一方面，端到端ASR已成为一种流行的替代方法，它通过用单个深度网络体系结构表示复杂的模块，并用数据代替语言资源的使用，大大简化了常规ASR系统的模型构建过程。驱动的学习方法。 ASR的端到端架构主要有两种：基于注意力的方法使用一种注意力机制来执行声帧与已识别符号之间的对齐，而连接主义者的时间分类（CTC）使用马尔可夫假设来通过动态编程有效地解决顺序问题。本文提出了一种混合CTC /注意力端到端ASR，它可以在训练和解码中有效利用两种架构的优势。在培训期间，我们采用多目标学习框架来提高鲁棒性并实现快速收敛。在解码期间，我们通过在单程波束搜索算法中结合基于注意力和CTC分数来执行联合解码，以进一步消除不规则对齐。用英语（WSJ和CHiME-4）任务进行的实验证明了在CTC和基于注意力的编码器-解码器基线上提出的多目标学习的有效性。此外，该方法被应用于两个大型ASR基准（自发日语和普通话），并且基于多目标学习和联合解码的优势，无需语言资源，其性能可与传统DNN / HMM ASR系统相媲美。

著录项

来源
《Selected Topics in Signal Processing, IEEE Journal of》 |2017年第8期|1240-1253|共14页
作者
Shinji Watanabe; Takaaki Hori; Suyoun Kim; John R. Hershey; Tomoki Hayashi;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Hidden Markov models; Neural networks; Machine learning; Markov processes; Automatic speech recognition; Probabilistic logic;

机译：隐马尔可夫模型;神经网络;机器学习;马尔可夫过程;自动语音识别;概率逻辑;

相似文献

外文文献
中文文献
专利

1. Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder [J] . ZHU Tao, CHENG Chunling 上海交通大学学报（英文版） . 2020,第001期
2. Segment boundary detection directed attention for online end-to-end speech recognition [J] . Junfeng Hou, Wu Guo, Yan Song, EURASIP journal on audio, speech, and music processing . 2020,第1期

机译：段边界检测在线端到端语音识别的指导关注
3. Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition [J] . Su Sining, Guo Pengcheng, Xie Lei, Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2019,第11期

机译：基于注意力的端到端鲁棒语音识别的对抗规则化
4. Selective Adaptation of End-to-End Speech Recognition using Hybrid CTC/Attention Architecture for Noise Robustness [C] . Cong-Thanh Do, Shucong Zhang, Thomas Hain European Signal Processing Conference . 2020

机译：使用混合CTC /关注架构的结束语音识别选择性适应噪声鲁棒性的结束语音识别
5. Automatic speech recognition using LP-DCTC/DCS analysis followed by morphological filtering. [D] . Hix, Penny. 2006

机译：使用LP-DCTC / DCS分析进行自动语音识别，然后进行形态过滤。
6. End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture [O] . Long Zhang, Ziping Zhao, Chunmei Ma, 2020

机译：基于改进的混合CTC /注意架构的端到端自动语音错误检测
7. Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units [O] . Zhangyu Xiao, Zhijian Ou, Wei Chu, 2018

机译：混合CTC关注的基于端到端语音识别使用子字单元

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

摘要

著录项

相似文献

相关主题

期刊订阅