Attention-based encoder-decoder models have recently shown competitive performance for automatic speechrecognition (ASR) compared to conventional ASR systems. However, how to employ attention models for onlinespeech recognition still needs to be explored. Different from conventional attention models wherein the softalignment is obtained by a pass over the entire input sequence, attention models for online recognition must learnonline alignment to attend part of input sequence monotonically when generating output symbols. Based on the factthat every output symbol is corresponding to a segment of input sequence, we propose a new attention mechanismfor learning online alignment by decomposing the conventional alignment into two parts: segmentation—segmentboundary detection with hard decision—and segment-directed attention—information aggregation within thesegment with soft attention. The boundary detection is conducted along the time axis from left to right, and a decisionis made for each input frame about whether it is a segment boundary or not. When a boundary is detected, thedecoder generates an output symbol by attending the inputs within the corresponding segment. With the proposedattention mechanism, online speech recognition can be realized. The experimental results on TIMIT and WSJ datasetshow that our proposed attention mechanism achieves comparable online performance with state-of-the-art models.
展开▼