...
首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >Unsupervised language model adaptation for handwritten Chinese text recognition
【24h】

Unsupervised language model adaptation for handwritten Chinese text recognition

机译:手写中文识别的无监督语言模型自适应

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

This paper presents an effective approach for unsupervised language model adaptation (LMA) using multiple models in offline recognition of unconstrained handwritten Chinese texts. The domain of the document to recognize is variable and usually unknown a priori, so we use a two-pass recognition strategy with a pre-defined multi-domain language model set. We propose three methods to dynamically generate an adaptive language model to match the text output by first-pass recognition: model selection, model combination and model reconstruction. In model selection, we use the language model with minimum perplexity on the first-pass recognized text. By model combination, we learn the combination weights via minimizing the sum of squared error with both L2-norm and L1-norm regularization. For model reconstruction, we use a group of orthogonal bases to reconstruct a language model with the coefficients learned to match the document to recognize. Moreover, we reduce the storage size of multiple language models using two compression methods of split vector quantization (SVQ) and principal component analysis (PCA). Comprehensive experiments on two public Chinese handwriting databases CASIA-HWDB and HIT-MW show that the proposed unsupervised LMA approach improves the recognition performance impressively, particularly for ancient domain documents with the recognition accuracy improved by 7 percent. Meanwhile, the combination of the two compression methods largely reduces the storage size of language models with little loss of recognition accuracy.
机译:本文提出了一种有效的方法,用于在无约束的手写中文文本的离线识别中使用多个模型的无监督语言模型自适应(LMA)。要识别的文档的领域是可变的,并且通常是先验未知的,因此我们使用具有预定义多域语言模型集的两次通过识别策略。我们提出了三种方法来动态生成自适应语言模型以匹配通过首遍识别输出的文本:模型选择,模型组合和模型重构。在模型选择中,我们在首遍识别的文本上使用具有最小困惑度的语言模型。通过模型组合,我们通过最小化L2-norm和L1-norm正则化的平方误差总和来学习组合权重。对于模型重建,我们使用一组正交基来重建语言模型,该语言模型具有学习的系数以匹配文档以进行识别。此外,我们使用分裂矢量量化(SVQ)和主成分分析(PCA)两种压缩方法来减少多语言模型的存储大小。在两个公共中文手写数据库CASIA-HWDB和HIT-MW上的综合实验表明,所提出的无监督LMA方法显着提高了识别性能,特别是对于古代领域文档,其识别精度提高了7%。同时,两种压缩方法的组合大大减小了语言模型的存储大小,而几乎没有识别精度的损失。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号