首页> 外文期刊>Computer speech and language >Effect of acoustic and linguistic contexts on human and machine speech recognition
【24h】

Effect of acoustic and linguistic contexts on human and machine speech recognition

机译:声音和语言环境对人和机器语音识别的影响

获取原文
获取原文并翻译 | 示例

摘要

We compared the performance of an automatic speech recognition system using n-gram language models, HMM acoustic models, as well as combinations of the two, with the word recognition performance of human subjects who either had access to only acoustic information, had information only about local linguistic context, or had access to a combination of both. All speech recordings used were taken from Japanese narration and spontaneous speech corpora. Humans have difficulty recognizing isolated words taken out of context, especially when taken from spontaneous speech, partly due to word-boundary coarticulation. Our recognition performance improves dramatically when one or two preceding words are added. Short words in Japanese mainly consist of post-positional particles (i.e. wa, ga, wo, ni, etc.), which are function words located just after content words such as nouns and verbs. So the predictability of short words is very high within the context of the one or two preceding words, and thus recognition of short words is drastically improved. Providing even more context further improves human prediction performance under text-only conditions (without acoustic signals). It also improves speech recognition, but the improvement is relatively small. Recognition experiments using an automatic speech recognizer were conducted under conditions almost identical to the experiments with humans. The performance of the acoustic models without any language model, or with only a unigram language model, were greatly inferior to human recognition performance with no context. In contrast, prediction performance using a trigram language model was superior or comparable to human performance when given a preceding and a succeeding word. These results suggest that we must improve our acoustic models rather than our language models to make automatic speech recognizers comparable to humans in recognition performance under conditions where the recognizer has limited linguistic context.
机译:我们将使用n-gram语言模型,HMM声学模型以及两者的组合的自动语音识别系统的性能与只能访问声学信息,仅具有信息的人类受试者的单词识别性能进行了比较。本地语言环境,或者可以同时使用两者。使用的所有语音记录均取自日语旁白和自发的语音语料库。人类很难识别脱离上下文的孤立单词,特别是从自然语言中提取单词时,这部分是由于单词边界的共音所致。当添加一个或两个前面的单词时,我们的识别性能将大大提高。日语中的短词主要由后置词(即wa,ga,wo,ni等)组成,它们是功能词,紧跟在名词和动词等内容词之后。因此,在一个或两个前面的单词的上下文中,短单词的可预测性非常高,因此,大大提高了对短单词的识别能力。提供更多上下文可以进一步提高纯文本条件下(没有声音信号)的人类预测性能。它还可以改善语音识别,但是改善相对较小。使用自动语音识别器的识别实验是在与人类实验几乎相同的条件下进行的。没有任何语言模型或仅具有唯一语言模型的声学模型的性能大大逊于没有上下文的人类识别性能。相反,当给定前一个单词和后一个单词时,使用三字母组合语言模型的预测性能要优于或与人类性能相当。这些结果表明,在识别器具有有限的语言环境的条件下,我们必须改进声学模型而不是语言模型,以使自动语音识别器在识别性能上可与人类媲美。

著录项

  • 来源
    《Computer speech and language》 |2014年第3期|769-787|共19页
  • 作者单位

    Department of Media Science, Graduate School of Information Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi 464-8603, Japan;

    Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1 Hibarigaoka, Tempaku-cho, Toyohashi, Aichi 441-8580, Japan;

    Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1 Hibarigaoka, Tempaku-cho, Toyohashi, Aichi 441-8580, Japan;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Continuous speech recognition; Human speech recognition ability; Acoustic model; Language model;

    机译:连续语音识别;人类语音识别能力;声学模型语言模型;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号