首页> 外文会议>International Conference on Document Analysis and Recognition >Sub-Word Based Mongolian Offline Handwriting Recognition
【24h】

Sub-Word Based Mongolian Offline Handwriting Recognition

机译:基于子词的蒙古语离线手写识别

获取原文

摘要

Mongolian is an agglutinative language, which re-sults in a large number of words derived from the same stems connecting different suffixes. This morphological richness leads to high out-of-vocabulary (OOV) rates and causes problems of data sparsity. In this paper, our proposed recognition system is composed of three parts: handwritten image preprocessing, mapping of images to grapheme sequences, and sub-word-based language model (LM) decoding. We present a sub-word-based n-gram LM to solve the high OOV rate problem. According to the characteristics of Mongolian, we modified the traditional token passing algorithm to improve decoding speed and to easy to combine with any n-gram LM. We evaluated the performance of sub-words at different levels on the open Mongolian offline handwriting dataset (MHW). The bi-syllable 2-gram LM showed the best performance, with 18.32% and 23.22% word-error rates (WERs) on two test sets. Our various experiments show that, this method can predict in vocabulary words with a higher accuracy rate and also predict OOV words with a certain accuracy rate.
机译:蒙古语是一种凝集性语言,其产生的大量单词源自连接不同后缀的相同词干。这种形态上的丰富性导致高语音(OOV)率,并导致数据稀疏性问题。在本文中,我们提出的识别系统由三部分组成:手写图像预处理,图像到字素序列的映射以及基于子词的语言模型(LM)解码。我们提出一种基于子词的n-gram LM来解决高OOV率问题。根据蒙古文的特点,我们对传统的令牌传递算法进行了改进,以提高解码速度,并易于与任何n-gram LM组合。我们评估了蒙古在线离线手写数据集(MHW)上不同级别的子词的性能。双音节2克LM表现出最好的性能,在两个测试集上的单词错误率(WER)为18.32%和23.22%。我们的各种实验表明,该方法可以较高的准确率预测词汇单词,还可以以一定的准确率预测OOV单词。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号