首页> 外文学位 >The use of discrete distributions with a very large codebook for automatic speech recognition and speaker verification.

【24h】

The use of discrete distributions with a very large codebook for automatic speech recognition and speaker verification.

机译：离散分布与非常大的密码本的配合使用可用于自动语音识别和说话者验证。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the advance of semiconductor technology and the popularity of distributed speech/speaker recognition paradigm (e.g., Siri in iPhone4s), here we revisit the use of discrete model in automatic speech recognition (ASR) and speaker verification (SV) tasks. Compared with the dominant continuous density model, discrete model has inherently attractive properties: it uses non-parametric output distributions and takes only O(1) time to get the probability value from it; furthermore, the features used in the discrete model, compared with that in the continuous model, could be encoded in fewer bits, lowering the bandwidth requirement in distributed speech/speaker recognition architecture. Unfortunately, the recognition performance of a conventional discrete model is significantly worse than that of a continuous one due to the large quantization error and the use of multiple independent streams. In this thesis, we propose to reduce the quantization error of a discrete model by using a very large codebook with tens of thousands of codewords. The codebook of the proposed model is about a hundred times larger than that of a conventional discrete model, whose codebook size usually ranges from 256 to 1024. Accordingly, the number of parameters to specify a discrete output distribution grows by a hundred times in the proposed model. Compared with a discrete model of conventional sized codebook, there are two major challenges in building a very large codebook model. Firstly, given a continuous acoustic feature vector, how do we quickly find its corresponding codeword from a hundred-time larger codebook?;Secondly, given the limited amount of training data, how can we robustly train such a high-density model, which has a hundred times more parameters than the conventional model?;To find a codeword for an acoustic vector fast, we employ the subvector-quantized (SVQ) codebooks. SVQ codebooks represent a very large codebook in the full feature space by a combinatorial product of per-subvector smaller codebooks. To find a full space codeword is reduced to finding a set of SVQ codewords, which is very fast.;To robustly train such a high-density model, two techniques are explored. The first one is to do model conversion. A discrete model is converted directly from a well-trained continuous model and avoids direct training using the training data. The second one is by subspace modeling. In this technique, the original high-density discrete distribution table is treated a high dimensional vector and assumed to lie in some low dimensional subspace. By this subspace representation, the number of free parameters in the model is reduced by ten and hundred fold. As a result, the model could be trained robustly using the limited amount of data.;Experimental evaluations on both ASR and SV tasks show the feasibility and benefits of the very large codebook discrete model. On the WSJ0 (Wall Street Journal) ASR task, the proposed model shows comparable recognition accuracy as the continuous model with much faster decoding and lower bandwidth requirement. On the NIST (National Institute of Standards and Technology) 2002 SV task, a speedup of 8-25 fold is achieved with almost no loss in verification performance.

机译：随着半导体技术的进步以及分布式语音/说话者识别范例（例如iPhone4s中的Siri）的普及，我们在这里重新审视了离散模型在自动语音识别（ASR）和说话者验证（SV）任务中的使用。与占主导地位的连续密度模型相比，离散模型具有内在的吸引人的特性：它使用非参数输出分布，并且仅需O（1）时间即可从中获得概率值；此外，与连续模型相比，离散模型中使用的功能可以用更少的比特进行编码，从而降低了分布式语音/扬声器识别架构中的带宽需求。不幸的是，由于大的量化误差和使用多个独立的流，传统离散模型的识别性能明显不如连续模型。在本文中，我们建议通过使用具有数万个码字的超大型码本来减少离散模型的量化误差。所提出的模型的代码簿比常规离散模型的代码簿大大约一百倍，传统离散模型的代码簿大小通常在256到1024之间。因此，在提议的模型中，用于指定离散输出分布的参数数量增长了一百倍。模型。与传统大小的码本的离散模型相比，构建非常大的码本模型存在两个主要挑战。首先，给定一个连续的声学特征向量，如何从百倍大码本中快速找到其对应的码字;其次，在训练数据量有限的情况下，我们如何稳健地训练这种高密度模型，该模型具有比传统模型多一百倍的参数？；为了快速找到声学矢量的码字，我们采用了子矢量量化（SVQ）码本。 SVQ代码簿通过每个子矢量较小的代码簿的组合乘积来表示整个功能空间中的很大代码簿。找到一个全空间码字被简化为找到一组SVQ码字，这是非常快的。为了鲁棒地训练这种高密度模型，探索了两种技术。第一个是进行模型转换。离散模型直接从训练有素的连续模型转换而来，避免使用训练数据进行直接训练。第二个是通过子空间建模。在这种技术中，原始的高密度离散分布表被视为高维向量，并假定位于某些低维子空间中。通过这种子空间表示，模型中的自由参数数量减少了一百倍。结果，可以使用有限的数据来对模型进行健壮的训练。对ASR和SV任务的实验评估表明，超大型码本离散模型的可行性和益处。在WSJ0（《华尔街日报》）的ASR任务上，所提出的模型显示出与连续模型相当的识别精度，具有更快的解码速度和更低的带宽要求。在NIST（美国国家标准技术研究院）2002 SV任务上，实现了8-25倍的加速，而验证性能几乎没有损失。

著录项

作者
Ye, Guoli.;
展开▼
作者单位

Hong Kong University of Science and Technology (Hong Kong).;

展开▼
授予单位 Hong Kong University of Science and Technology (Hong Kong).;
学科 Computer Science.
学位 Ph.D.
年度 2013
页码 141 p.
总页数 141
原文格式 PDF
正文语种 eng
中图分类
关键词
入库时间 2022-08-17 11:41:44

相似文献

外文文献
中文文献
专利

1. TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition [J] . Li Wenjie, Zhang Pengyuan, Yan Yonghong Electronics Letters . 2019,第14期

机译：TEnet：目标说话人提取网络，具有累积的说话人嵌入功能，可自动识别语音
2. Speaker indexing based on speaker model selection and automatic speech recognition in discussions [J] . Masafumi Nishida, Yuya Akita, Tatsuya Kawahara 電子情報通信学会技術研究報告. 音声. Speech . 2002,第530期

机译：讨论中基于说话人模型选择和自动语音识别的说话人索引
3. Speaker indexing based on speaker model selection and automatic speech recognition in discussions [J] . Masafumi Nishida, Yuya Akita, Tatsuya Kawahara 電子情報通信学会技術研究報告. 言語理解とコミュニケーション. Natural Language Understanding and Models of Communication . 2002,第528期

机译：讨论中基于说话人模型选择和自动语音识别的说话人索引
4. Speaker Independent Automatic Emotion Recognition from Speech: A Comparison of MFCCs and Discrete Wavelet Transforms [C] . Firoz Shah A., Vimal Krishnan V. R., Raji Sukumar A., International Conference on Advances in Recent Technologies in Communication and Computing . 2009

机译：扬声器独立自动情感识别来自语音：MFCC和离散小波变换的比较
5. Automatic speechreading for improved speech recognition and speaker verification. [D] . Zhang, Xiaozheng. 2002

机译：自动语音朗读可改善语音识别和说话者验证。
6. Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference [O] . Byeongwook Lee, Kwang-Hyun Cho -1

机译：以语音包络作为时间参考的自动语音识别的大脑启发式语音分割
7. Comparative Analysis of Automatic Speaker Recognition using Kekre’s Fast Codebook Generation Algorithm in Time and Transform Domain [O] . Dr. H. B. Kekre, Vaishali Kulkarni 2011

机译：使用时域和变换域中的Kekre快速码本生成算法对说话人自动识别的比较分析

The use of discrete distributions with a very large codebook for automatic speech recognition and speaker verification.

摘要

著录项

相似文献

相关主题

期刊订阅