首页> 外文会议> >Model-based information extraction method tolerant of OCR errors for document images
【24h】

Model-based information extraction method tolerant of OCR errors for document images

机译:容忍文档图像OCR错误的基于模型的信息提取方法

获取原文

摘要

A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract required keywords and their logical relationship from various printed documents. Such documents obtained from OCR results may have not only unknown words and compound words, but also incorrect words due to OCR errors. To cope with OCR errors, the proposed method adopts robust keyword matching which searches for a string pattern from two dimensional OCR results consisting of a set of possible character candidates. This keyword matching uses a keyword dictionary that includes incorrect words with typical OCR errors and segments of words to deal with the above difficulties. After keyword matching, a global document matching is carried out between keyword matching results in an input document and document models which consist of keyword models and their logical relationship. This global matching determines the most suitable model for the input document and solves word segmentation problems accurately even if the document has unknown words, compound words, or incorrect words. Experimental results obtained for 100 documents show that the method is robust and effective for various document structures.
机译:本文提出了一种从文档图像中提取信息的新方法,作为文档阅读器的基础,该文档阅读器可以从各种印刷文档中提取所需的关键字及其逻辑关系。从OCR结果获得的此类文档不仅可能具有未知词和复合词,而且由于OCR错误而可能具有不正确的词。为了解决OCR错误,该方法采用了健壮的关键字匹配,该关键字匹配从包含一组可能的字符候选的二维OCR结果中搜索字符串模式。该关键字匹配使用关键字词典,该词典包含具有典型OCR错误的不正确单词和单词段以应对上述困难。关键字匹配后,在输入文档中的关键字匹配结果和由关键字模型及其逻辑关系组成的文档模型之间进行全局文档匹配。这种全局匹配为输入文档确定了最合适的模型,并且即使文档包含未知单词,复合单词或错误单词,也可以准确地解决分词问题。针对100个文档获得的实验结果表明,该方法对于各种文档结构均是可靠且有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号