...
【24h】

Model-based information extraction method tolerant of OCR errors for document images

机译:基于模型的信息提取方法容忍文档图像的OCR错误

获取原文
获取原文并翻译 | 示例
           

摘要

A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract required keywords and their logical relationship from various printed documents. The proposed method consists of robust keyword matching, global document matching, and postprocessing for keyword matching errors. First, robust keyword matching between a set of text lines extracted from an input image and a set of keywords defined in the keyword dictionary is carried out. This keyword matching uses a keyword dictionary that includes incorrect words with typical OCR errors and segments of words to deal with OCR errors. Next, document matching is invoked between keyword matching results in the input document and word models defined in each document model. Each document model consists of a set of word models with their logical relationship described in terms of a tree structure. This model matching extracts required keywords and their logical relationship from the input document and determines the most suitable model for the input document. Finally, postprocessing for recovering matching errors and modifying matching results using heuristic rules defined in the model is applied to keyword matching results. This comprehensive approach solves word segmentation problems accurately even if a document has unknown words, compound words, or incorrect words due to OCR errors. Experimental results obtained for 100 documents show that the method is robust and effective for various document structures.
机译:本文提出了一种从文档图像提取信息提取的新方法作为文档读取器的基础,可以从各种打印文档中提取所需的关键字及其逻辑关系。该方法包括强大的关键字匹配,全局文档匹配和关键字匹配错误的后处理。首先,执行从输入图像中提取的一组文本行之间的鲁棒关键字匹配,以及在关键字字典中定义的一组关键字之间。此关键字匹配使用关键字字典,该字典包含不正确的单词,其中包含典型的OCR错误和单词的段来处理OCR错误。接下来,在每个文档模型中定义的输入文档和Word模型中,在关键字匹配结果之间调用文档匹配。每个文档模型由一组Word模型组成,其逻辑关系在树结构方面描述。此模型与输入文档匹配所需的关键字及其逻辑关系,并确定输入文档的最合适的模型。最后,在将模型中定义的启发式规则应用于恢复匹配错误和修改匹配结果的后处理应用于关键字匹配结果。即使文档具有未知的单词,复合单词或由于OCR错误而不正确的单词,即使文档具有未知的单词,复合单词或错误的单词,此综合方法也可以解决。获得100个文件的实验结果表明,该方法对各种文档结构具有鲁棒性和有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号