首页> 外文会议>Joint IEEE International Conference on Development and Learning and Epigenetic Robotics >Unsupervised learning for spoken word production based on simultaneous word and phoneme discovery without transcribed data
【24h】

Unsupervised learning for spoken word production based on simultaneous word and phoneme discovery without transcribed data

机译:基于无单词转录数据的同时单词和音素发现的无监督学习

获取原文

摘要

A computational model that can reproduce the process of language acquisition, including word discovery and generation, by human children is crucially important in understanding the human developmental process. Such a model should not depend on transcribed data, which are often provided manually when researchers train artificial automatic speech recognition and speech synthesis systems. One of the main differences between speech recognition and production by human infants and those by conventional computer systems concerns the access to transcribed data, i.e., supervised learning with transcribed data or unsupervised learning without transcribed data. This study proposes an unsupervised machine learning method for spoken word production that does not use any transcribed data, i.e., the entire system is trained purely using speech signals that the system (the robot) can obtain from its auditory sensor, e.g., a microphone. The method is based on the nonparametric Bayesian double articulation analyzer (NPB-DAA), which is an unsupervised machine learning method that enables a robot to identify word-like and phoneme-like linguistic units in speech signals alone, and a hidden Markov model-based (HMM-based) statistical speech synthesis method, which has been widely used to develop text-to-speech (TTS) systems. Latent letters, i.e., phoneme-like units, and latent words, i.e., word-like units, discovered by the NPB-DAA are used to train the HMM-based TTS system. We present two experiments that used Japanese vowel sequences and an English spoken digit corpus, respectively. Both experimental results showed that the proposed method can produce many spoken words that can be recognized as the original words provided by the human speakers. Furthermore, we discuss future challenges in creating a robot that can autonomously learn phoneme systems and vocabulary only from sensor-motor information.
机译:能够重现人类儿童的语言习得过程(包括单词发现和生成过程)的计算模型对于理解人类的发展过程至关重要。这样的模型不应该依赖于转录数据,当研究人员训练人工自动语音识别和语音合成系统时,转录数据通常是手动提供的。人类婴儿和常规计算机系统在语音识别和生产之间的主要区别之一涉及对转录数据的访问,即,具有转录数据的监督学习或没有转录数据的无监督学习。这项研究提出了一种用于语音单词生成的无监督机器学习方法,该方法不使用任何转录的数据,即整个系统仅使用语音信号进行训练,系统(机器人)可以从其听觉传感器(例如麦克风)获得语音信号。该方法基于非参数贝叶斯双发音分析器(NPB-DAA),这是一种无监督的机器学习方法,可让机器人仅在语音信号中识别类似单词和音素的语言单元,以及隐藏的马尔可夫模型。基于(HMM)的统计语音合成方法,已被广泛用于开发文本到语音(TTS)系统。由NPB-DAA发现的潜在字母(即,类似音素的单位)和潜在词(即,类似于单词的单位)用于训练基于HMM的TTS系统。我们提出了两个分别使用日语元音序列和英语口语语料库的实验。两项实验结果均表明,该方法可以产生许多语音,这些语音可以被识别为人类说话者提供的原始单词。此外,我们讨论了在创建只能从传感器-运动信息中自主学习音素系统和词汇的机器人方面的未来挑战。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号