首页> 外文期刊>Neural Computing & Applications >Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks
【24h】

Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks

机译:语音识别神经网络中输入模式对说话人变异性的非线性归一化

获取原文
获取原文并翻译 | 示例

摘要

The issue of input variability resulting from speaker changes is one of the most crucial factors influencing the effectiveness of speech recognition systems. A solution to this problem is adaptation or normalization of the input, in a way that all the parameters of the input representation are adapted to that of a single speaker, and a kind of normalization is applied to the input pattern against the speaker changes, before recognition. This paper proposes three such methods in which some effects of the speaker changes influencing speech recognition process is compensated. In all three methods, a feed-forward neural network is first trained for mapping the input into codes representing the phonetic classes and speakers. Then, among the 71 speakers used in training, the one who is showing the highest percentage of phone recognition accuracy is selected as the reference speaker so that the representation parameters of the other speakers are converted to the corresponding speech uttered by him. In the first method, the error back-propagation algorithm is used for finding the optimal point of every decision region relating to each phone of each speaker in the input space for all the phones and all the speakers. The distances between these points and the corresponding points related to the reference speaker are employed for offsetting the speaker change effects and the adaptation of the input signal to the reference speaker. In the second method, using the error back-propagation algorithm and maintaining the reference speaker data as the desirable speaker output, we correct all the speech signal frames, i.e., the train and the test datasets, so that they coincide with the corresponding speech of the reference speaker. In the third method, another feed-forward neural network is applied inversely for mapping the phonetic classes and speaker information to the input representation. The phonetic output retrieved from the direct network along with the reference speaker data are given to the inverse network. Using this information, the inverse network yields an estimation of the input representation adapted to the reference speaker. In all three methods, the final speech recognition model is trained using the adapted training data, and is tested by the adapted testing data. Implementing these methods and combining the final network results with un-adapted network based on the highest confidence level, an increase of 2.1, 2.6 and 3% in phone recognition accuracy on the clean speech is obtained from the three methods, respectively.
机译:说话者变化导致的输入可变性问题是影响语音识别系统有效性的最关键因素之一。该问题的解决方案是输入的自适应或归一化,以使输入表示的所有参数都适应单个扬声器的参数,并在扬声器改变之前对输入模式应用一种归一化承认。本文提出了三种这样的方法,其中补偿了说话人变化影响语音识别过程的某些影响。在这三种方法中,都首先训练了前馈神经网络,用于将输入映射到表示语音分类和说话者的代码中。然后,在训练中使用的71位说话者中,将表现出最高电话识别准确率的说话者选作参考说话者,以便将其他说话者的表示参数转换为他所说的相应语音。在第一种方法中,误差反向传播算法用于在所有电话和所有扬声器的输入空间中找到与每个扬声器的每个电话有关的每个决策区域的最佳点。这些点和与参考扬声器有关的对应点之间的距离用于抵消扬声器改变效果和输入信号到参考扬声器的适应。在第二种方法中,使用误差反向传播算法并将参考说话者数据保持为理想的说话者输出,我们校正所有语音信号帧,即火车和测试数据集,以使其与语音的对应语音一致。参考发言人。在第三种方法中,反向应用另一个前馈神经网络,用于将语音类别和说话者信息映射到输入表示形式。从直接网络检索的语音输出与参考说话者数据一起被提供给逆网络。利用该信息,逆网络产生适合于参考说话者的输入表示的估计。在所有这三种方法中,最终的语音识别模型都使用调整后的训练数据进行训练,并通过调整后的测试数据进行测试。实施这些方法并将最终的网络结果与基于最高置信度的不适应网络相结合,从这三种方法中,语音识别的电话识别准确率分别提高了2.1%,2.6%和3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号