首页> 外文会议> >Model-based information extraction method tolerant of OCR errors for document images

【24h】

Model-based information extraction method tolerant of OCR errors for document images

机译：容忍文档图像OCR错误的基于模型的信息提取方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract required keywords and their logical relationship from various printed documents. Such documents obtained from OCR results may have not only unknown words and compound words, but also incorrect words due to OCR errors. To cope with OCR errors, the proposed method adopts robust keyword matching which searches for a string pattern from two dimensional OCR results consisting of a set of possible character candidates. This keyword matching uses a keyword dictionary that includes incorrect words with typical OCR errors and segments of words to deal with the above difficulties. After keyword matching, a global document matching is carried out between keyword matching results in an input document and document models which consist of keyword models and their logical relationship. This global matching determines the most suitable model for the input document and solves word segmentation problems accurately even if the document has unknown words, compound words, or incorrect words. Experimental results obtained for 100 documents show that the method is robust and effective for various document structures.

机译：本文提出了一种从文档图像中提取信息的新方法，作为文档阅读器的基础，该文档阅读器可以从各种印刷文档中提取所需的关键字及其逻辑关系。从OCR结果获得的此类文档不仅可能具有未知词和复合词，而且由于OCR错误而可能具有不正确的词。为了解决OCR错误，该方法采用了健壮的关键字匹配，该关键字匹配从包含一组可能的字符候选的二维OCR结果中搜索字符串模式。该关键字匹配使用关键字词典，该词典包含具有典型OCR错误的不正确单词和单词段以应对上述困难。关键字匹配后，在输入文档中的关键字匹配结果和由关键字模型及其逻辑关系组成的文档模型之间进行全局文档匹配。这种全局匹配为输入文档确定了最合适的模型，并且即使文档包含未知单词，复合单词或错误单词，也可以准确地解决分词问题。针对100个文档获得的实验结果表明，该方法对于各种文档结构均是可靠且有效的。

著录项

来源
《》|2001年|P.908-915|共8页
会议地点
作者
Ishitani; Y.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词

相似文献

外文文献
中文文献
专利

1. Model-based information extraction method tolerant of OCR errors for document images [J] . Yasuto Ishitani, Toshihiro Nakamura 電子情報通信学会技術研究報告. 言語理解とコミュニケーション. Natural Language Understanding and Models of Communication . 2001,第711期

机译：容忍文档图像OCR错误的基于模型的信息提取方法
2. Model-based information extraction method tolerant of OCR errors for document images [J] . Yasuto Ishitani, Toshihiro Nakamura 電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding . 2001,第712期

机译：容忍文档图像OCR错误的基于模型的信息提取方法
3. Model-based information extraction method tolerant of OCR errors for document images [J] . Yasuto Ishitani, Toshihiro Nakamura 電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding . 2001,第712期

机译：基于模型的信息提取方法容忍文档图像的OCR错误
4. A document image retrieval method tolerating recognition and segmentation errors of OCR using shape-feature and multiple candidates [C] . Kameshiro, T., Hirano, . 1999

机译：一种利用形状特征和多个候选对象容忍OCR识别和分割错误的文档图像检索方法
5. Iterative model-based binarization for document images. [D] . Dawoud, Amer. 2003

机译：基于迭代模型的文档图像二值化。
6. Frequency of data extraction errors and methods to increase data extraction quality: a methodological review [O] . Tim Mathes, Pauline Klaßen, Dawid Pieper 2017

机译：数据提取错误的频率和提高数据提取质量的方法：方法论综述
7. Performing Information Extraction to Improve OCR Error Detection in Semi-structured Historical Documents [O] . Thomas L. Packer 2012

机译：执行信息提取以改善半结构化历史文献中的OCR错误检测
8. Model Based Restoration of Document Images for OCR [R] . M. Y. Jaisimha, Eve A. Riskin, Richard Ladner 1996

机译：基于模型的OCR文档图像恢复

Model-based information extraction method tolerant of OCR errors for document images

摘要

著录项

相似文献

相关主题

期刊订阅