首页> 外文期刊>Systems and Computers in Japan >Robust Language Modeling for a Small Corpus of Target Tasks Using Class-Combined Word Statistics and Selective Use of a General Corpus
【24h】

Robust Language Modeling for a Small Corpus of Target Tasks Using Class-Combined Word Statistics and Selective Use of a General Corpus

机译:使用类组合词统计和通用语料库的选择性使用,对目标任务的小型语料库进行稳健的语言建模

获取原文
获取原文并翻译 | 示例
           

摘要

In order to improve the accuracy of language models in speech recognition tasks for which collecting a large text corpus for language model training is difficult, we propose a class-combined bigram and selective use of general text. In the class-combined bigram, the word bigram and the class bigram are combined using weights that are expressed as the functions of the preceding word frequency and the succeeding word-type count. An experiment has shown that the accuracy of the proposed class-combined bigram is equivalent to that of the word bigram trained with a text corpus that is approximately three times larger. In the selective use of general text, the language model was corrected by automatically selecting sentences that were expected to produce better accuracy from a large volume of text collected without specifying the task and by adding these sentences to a small corpus of target tasks. An experiment has shown that the recognition error rate was reduced by up to 12% compared to a case in which text was not selected. Lastly, when we created a model that uses both the class-combined bigram and text addition, further improvements were obtained, resulting in improvements of approximately 34% in adjusted perplexity and approximately 31% in the recognition error rate compared to the word bigram created from the target task text only.
机译:为了提高语音识别任务中语言模型的准确性,这些任务很难收集大型文本语料库来进行语言模型训练,我们提出了一种结合了类的双字母组和选择性使用通用文本的建议。在类组合的二元组中,单词bigram和类二元组使用权重进行组合,这些权重表示为前一个单词频率和后继单词类型计数的函数。实验表明,所提出的类组合双语法例的准确性与使用文本语料库训练的双语法例词的准确性大约相等,后者大约大三倍。在选择性使用普通文本时,通过从不指定任务的情况下自动从大量文本中选择预期会产生更高准确性的句子,并将这些句子添加到目标任务的小型语料库中,从而纠正了语言模型。实验表明,与未选择文本的情况相比,识别错误率降低了多达12%。最后,当我们创建一个同时使用类组合双字词和文本加法的模型时,获得了进一步的改进,与从中创建的双字词相比,调整后的困惑度提高了约34%,识别错误率提高了约31%。仅目标任务文本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号