【24h】

A Novel Approach to Improve the Mongolian Language Model Using Intermediate Characters

机译:用中间字符改进蒙古语语言模型的新方法

获取原文

摘要

In Mongolian language, there is a phenomenon that many words have the same presentation form but represent different words with different codes. Since typists usually input the words according to their representation forms and cannot distinguish the codes sometimes, there are lots of coding errors occurred in Mongolian corpus. It results in statistic and retrieval very difficult on such a Mongolian corpus. To solve this problem, this paper proposed a method which merges the words with same presentation forms by Intermediate characters, then use the corpus in Intermediate characters form to build Mongolian language model. Experimental result shows that the proposed method can reduce the perplexity and the word error rate for the 3-gram language model by 41 % and 30 % respectively when comparing model trained on the corpus without processing. The proposed approach significantly improves the performance of Mongolian language model and greatly enhances the accuracy of Mongolian speech recognition.
机译:在蒙古语中,有一种现象,许多单词具有相同的表示形式,但代表不同的单词,且具有不同的代码。由于打字员通常根据其表示形式输入单词,有时无法区分代码,因此蒙古语语料库中出现很多编码错误。这样的蒙古语语料库导致统计和检索非常困难。为了解决这个问题,本文提出了一种方法,即通过中间字符将具有相同表示形式的单词进行合并,然后以中间字符形式的语料库构建蒙古语语言模型。实验结果表明,与未经处理的语料库训练模型相比,该方法可以将3-gram语言模型的困惑度和单词错误率分别降低41%和30%。所提出的方法大大提高了蒙古语言模型的性能,并大大提高了蒙古语音识别的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号