首页> 外文会议>IAPR International Conference on Document Analysis and Recognition >Error Detection and Corrections in Indic OCR Using LSTMs
【24h】

Error Detection and Corrections in Indic OCR Using LSTMs

机译:使用LSTMS的Admin OCR中的错误检测和校正

获取原文

摘要

Conventional approaches to spell checking suggest spelling corrections using proximity-based matches to a known vocabulary. For highly inflectional Indian languages, any off-the-shelf vocabulary is significantly incomplete, since a large fraction of words in Indic documents are generated using word conjoining rules. Therefore, a tremendous manual effort is needed in spell-correcting words in Indic OCR documents. Moreover, in a spell checking system, a vocabulary may suggest multiple alternatives to the incorrect word. The ranking of these corrective suggestions is improved using language models. Owing to corpus resource scarcity, however, Indian languages lack reliable language models. Thus, learning the character (or n-gram) confusions or error patterns of the OCR system can be helpful in correcting the Out of Vocabulary (OOV) words in OCR documents. We adopt a Long Short-Term Memory (LSTM) based character level language model with a fixed delay for discriminative language modeling in the context of OCR errors for jointly addressing the problems of error detection and correction in Indic OCR. For words that need not be corrected in the OCR output, our model simply abstains from suggesting any changes. We present extensive results to validate the performance of our model on four Indian languages with different inflectional complexities. We achieve F-Scores above 92.4% and decreases in Word Error Rates (WER) of at least 26.7% across the four languages.
机译:法术检查的常规方法建议使用基于接近的匹配来拼写校正到已知词汇表。对于高度拐点的印度语言,任何现成的词汇表单都明显不完整,因为使用单词联合规则生成了广告文档中的大量单词。因此,在Admin OCR文档中的拼写纠正单词中需要巨大的手动努力。此外,在拼写检查系统中,词汇表可以建议不正确的单词的多个替代方案。使用语言模型改进了这些纠正建议的排名。然而,由于语料库资源稀缺,印度语言缺乏可靠的语言模型。因此,学习OCR系统的字符(或n-gram)的混淆或误差模式可以有助于纠正OCR文档中的词汇量(OOV)单词。我们采用了一种基于长期内存(LSTM)的字符级语言模型,在OCR错误的上下文中具有固定延迟,以便共同解决Indic OCR中的错误检测和校正问题。对于不需要在OCR输出中纠正的单词,我们的模型简单地禁止建议任何更改。我们展示了广泛的结果,以验证我们的模型在四种印度语言中的性能,不同的拐点复杂性。我们在92.4 %以上实现了F-Scrores,并在四种语言中减少了至少26.7 %的字错误率(WER)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号