Error Detection and Corrections in Indic OCR Using LSTMs

机译：使用LSTMS的Admin OCR中的错误检测和校正

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Conventional approaches to spell checking suggest spelling corrections using proximity-based matches to a known vocabulary. For highly inflectional Indian languages, any off-the-shelf vocabulary is significantly incomplete, since a large fraction of words in Indic documents are generated using word conjoining rules. Therefore, a tremendous manual effort is needed in spell-correcting words in Indic OCR documents. Moreover, in a spell checking system, a vocabulary may suggest multiple alternatives to the incorrect word. The ranking of these corrective suggestions is improved using language models. Owing to corpus resource scarcity, however, Indian languages lack reliable language models. Thus, learning the character (or n-gram) confusions or error patterns of the OCR system can be helpful in correcting the Out of Vocabulary (OOV) words in OCR documents. We adopt a Long Short-Term Memory (LSTM) based character level language model with a fixed delay for discriminative language modeling in the context of OCR errors for jointly addressing the problems of error detection and correction in Indic OCR. For words that need not be corrected in the OCR output, our model simply abstains from suggesting any changes. We present extensive results to validate the performance of our model on four Indian languages with different inflectional complexities. We achieve F-Scores above 92.4% and decreases in Word Error Rates (WER) of at least 26.7% across the four languages.

机译：法术检查的常规方法建议使用基于接近的匹配来拼写校正到已知词汇表。对于高度拐点的印度语言，任何现成的词汇表单都明显不完整，因为使用单词联合规则生成了广告文档中的大量单词。因此，在Admin OCR文档中的拼写纠正单词中需要巨大的手动努力。此外，在拼写检查系统中，词汇表可以建议不正确的单词的多个替代方案。使用语言模型改进了这些纠正建议的排名。然而，由于语料库资源稀缺，印度语言缺乏可靠的语言模型。因此，学习OCR系统的字符（或n-gram）的混淆或误差模式可以有助于纠正OCR文档中的词汇量（OOV）单词。我们采用了一种基于长期内存（LSTM）的字符级语言模型，在OCR错误的上下文中具有固定延迟，以便共同解决Indic OCR中的错误检测和校正问题。对于不需要在OCR输出中纠正的单词，我们的模型简单地禁止建议任何更改。我们展示了广泛的结果，以验证我们的模型在四种印度语言中的性能，不同的拐点复杂性。我们在92.4 ％以上实现了F-Scrores，并在四种语言中减少了至少26.7 ％的字错误率（WER）。

著录项

来源
《IAPR International Conference on Document Analysis and Recognition》|2017年|732p|共6页
会议地点
作者
Rohit Saluja; Devaraj Adiga; Parag Chaudhuri; Ganesh Ramakrishnan; Mark Carman;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP391.41-53;
关键词
Optical character recognition software; Delays; Vocabulary; Hidden Markov models; Context modeling; Histograms; Reliability;

机译：光学字符识别软件;延迟;词汇;隐藏的马尔可夫模型;上下文建模;直方图;可靠性;

相似文献

外文文献
中文文献
专利

1. OCR error correction using correction patterns and self-organizing migrating algorithm [J] . Nguyen Quoc-Dung, Le Duc-Anh, Phan Nguyet-Minh, Pattern Analysis and Applications . 2021,第2期

机译：使用校正模式和自组织迁移算法OCR纠错
2. Ontologies and Bigram-based approach for Isolated Non-word Errors Correction in OCR System [J] . Aicha Eutamene, Mohamed Khireddine Kholladi, Hacene Belhadef International Journal of Electrical and Computer Engineering . 2015,第6期

机译：OCR系统中的孤立非词错误校正的本体和基于Bigram的方法
3. OCRSpell: an interactive spelling correction system for OCR errors in text [J] . Kazem Taghva, Eric Stofsky International Journal on Document Analysis and Recognition . 2001,第3期

机译：OCRSpell：用于文本中OCR错误的交互式拼写更正系统
4. Error Detection and Corrections in Indic OCR Using LSTMs [C] . Rohit Saluja, Devaraj Adiga, Parag Chaudhuri, IAPR International Conference on Document Analysis and Recognition . 2017

机译：使用LSTM在印度OCR中进行错误检测和纠正
5. Utilizing big data in identification and correction of OCR errors. [D] . Agarwal, Shivam. 2013

机译：利用大数据识别和纠正OCR错误。
6. Detection and Correction of Laterality Errors in Radiology Reports [O] . Young Han Lee, Jaemoon Yang, Jin-Suck Suh 2015

机译：放射学报告中横向误差的检测和纠正
7. Utilizing Big Data in Identification and Correction of OCR Errors [O] . Agarwal Shivam 2013

机译：利用大数据识别和纠正OCR错误
8. Method for detection and correction of errors in speech pitch period estimates [R] . 1989

机译：用于检测和校正语音音调周期估计中的误差的方法

Error Detection and Corrections in Indic OCR Using LSTMs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅