首页> 外文会议>IAPR International Workshop on Document Analysis Systems >Context-Dependent Confusions Rules for Building Error Model Using Weighted Finite State Transducers for OCR Post-Processing
【24h】

Context-Dependent Confusions Rules for Building Error Model Using Weighted Finite State Transducers for OCR Post-Processing

机译:使用加权有限状态传感器进行OCR后处理的建筑误差模型的上下文相关混淆规则

获取原文

摘要

In this paper, we propose a new technique to correct the OCR errors by means of weighted finite state transducers(WFST) with context-dependent confusion rules. We translate the OCR confusions which appear in the recognition outputs into edit operations, e.g. insertions, deletions and substitutions using Levenshtein edit distance algorithm. The edit operations are extracted in a form of rules with respect to the context of the incorrect string to build an error model using weighted finitestate transducers. The context-dependent rules help to fit the rule in the appropriate strings. Our new error model avoids the calculations that occur in searching the language model and it also makes the language model eligible to correct incorrect words by using context-dependent confusion rules. Our approach is language independent. It designed to deal with different number of errors. It has no limited words size. In the set of experiments conducted on the ocred pages from the UWIII dataset, our new proposed error model outperforms. The evaluation shows the error rate of our model on the UWIII testset is 0.68%, while the baseline is 1.14% and the error rate of the existing state-of-the-artsingle character rules-based approach is 1.0%.
机译:在本文中,我们提出了一种新的技术,可以通过具有上下文相关混淆规则的加权有限状态传感器(WFST)来纠正OCR错误。我们将出现在识别输出中的OCR混乱转化为编辑操作,例如使用Levenshtein编辑距离算法进行插入,删除和替换。针对不正确字符串的上下文,以规则形式提取编辑操作,以使用加权有限状态传感器建立错误模型。上下文相关的规则有助于将规则适合适当的字符串。我们的新错误模型避免了在搜索语言模型时发生的计算,并且还使该语言模型有资格使用上下文相关的混淆规则来纠正不正确的单词。我们的方法是独立于语言的。它旨在处理不同数量的错误。它没有字数限制。在对UWIII数据集的被选页面上进行的一组实验中,我们提出的新错误模型的表现优于其他模型。评估显示,我们的模型在UWIII测试集上的错误率是0.68%,而基线是1.14%,现有的基于最新字符规则的方法的错误率是1.0%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号