Model-based information extraction method tolerant of OCR errors for document images

Yasuto Ishitani; Toshihiro Nakamura

首页> 外文期刊>電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding >Model-based information extraction method tolerant of OCR errors for document images

【24h】

Model-based information extraction method tolerant of OCR errors for document images

机译：基于模型的信息提取方法容忍文档图像的OCR错误

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract required keywords and their logical relationship from various printed documents. The proposed method consists of robust keyword matching, global document matching, and postprocessing for keyword matching errors. First, robust keyword matching between a set of text lines extracted from an input image and a set of keywords defined in the keyword dictionary is carried out. This keyword matching uses a keyword dictionary that includes incorrect words with typical OCR errors and segments of words to deal with OCR errors. Next, document matching is invoked between keyword matching results in the input document and word models defined in each document model. Each document model consists of a set of word models with their logical relationship described in terms of a tree structure. This model matching extracts required keywords and their logical relationship from the input document and determines the most suitable model for the input document. Finally, postprocessing for recovering matching errors and modifying matching results using heuristic rules defined in the model is applied to keyword matching results. This comprehensive approach solves word segmentation problems accurately even if a document has unknown words, compound words, or incorrect words due to OCR errors. Experimental results obtained for 100 documents show that the method is robust and effective for various document structures.

机译：本文提出了一种从文档图像提取信息提取的新方法作为文档读取器的基础，可以从各种打印文档中提取所需的关键字及其逻辑关系。该方法包括强大的关键字匹配，全局文档匹配和关键字匹配错误的后处理。首先，执行从输入图像中提取的一组文本行之间的鲁棒关键字匹配，以及在关键字字典中定义的一组关键字之间。此关键字匹配使用关键字字典，该字典包含不正确的单词，其中包含典型的OCR错误和单词的段来处理OCR错误。接下来，在每个文档模型中定义的输入文档和Word模型中，在关键字匹配结果之间调用文档匹配。每个文档模型由一组Word模型组成，其逻辑关系在树结构方面描述。此模型与输入文档匹配所需的关键字及其逻辑关系，并确定输入文档的最合适的模型。最后，在将模型中定义的启发式规则应用于恢复匹配错误和修改匹配结果的后处理应用于关键字匹配结果。即使文档具有未知的单词，复合单词或由于OCR错误而不正确的单词，即使文档具有未知的单词，复合单词或错误的单词，此综合方法也可以解决。获得100个文件的实验结果表明，该方法对各种文档结构具有鲁棒性和有效。

著录项

来源
《電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding》 |2001年第712期|共8页
作者
Yasuto Ishitani; Toshihiro Nakamura;
展开▼
作者单位

ol.net;

ol.net;

展开▼
收录信息
原文格式 PDF
正文语种 jpn
中图分类图像通信、多媒体通信;
关键词
Information extraction; Document image analysis; Model-matching; Association graph; Maximal clique;

机译：信息提取;文档图像分析;模型匹配;关联图;最大集团;

相似文献

外文文献
中文文献
专利

1. Model-based information extraction method tolerant of OCR errors for document images [J] . Yasuto Ishitani, Toshihiro Nakamura 電子情報通信学会技術研究報告. 言語理解とコミュニケーション. Natural Language Understanding and Models of Communication . 2001,第711期

机译：容忍文档图像OCR错误的基于模型的信息提取方法
2. Model-based information extraction method tolerant of OCR errors for document images [J] . Yasuto Ishitani, Toshihiro Nakamura 電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding . 2001,第712期

机译：基于模型的信息提取方法容忍文档图像的OCR错误
3. Model-based information extraction method tolerant of OCR errors for document images [J] . Yasuto Ishitani, Toshihiro Nakamura 電子情報通信学会技術研究報告. 言語理解とコミュニケーション. Natural Language Understanding and Models of Communication . 2001,第711期

机译：基于模型的信息提取方法容忍文档图像的OCR误差
4. Model-based information extraction method tolerant of OCR errors for document images [C] . Ishitani, Y. . 2001

机译：容忍文档图像OCR错误的基于模型的信息提取方法
5. Iterative model-based binarization for document images. [D] . Dawoud, Amer. 2003

机译：基于迭代模型的文档图像二值化。
6. Frequency of data extraction errors and methods to increase data extraction quality: a methodological review [O] . Tim Mathes, Pauline Klaßen, Dawid Pieper 2017

机译：数据提取错误的频率和提高数据提取质量的方法：方法论综述
7. Performing Information Extraction to Improve OCR Error Detection in Semi-structured Historical Documents [O] . Thomas L. Packer 2012

机译：执行信息提取以改善半结构化历史文献中的OCR错误检测
8. Model Based Restoration of Document Images for OCR [R] . M. Y. Jaisimha, Eve A. Riskin, Richard Ladner 1996

机译：基于模型的OCR文档图像恢复

Model-based information extraction method tolerant of OCR errors for document images

摘要

著录项

相似文献

相关主题

期刊订阅