首页> 外文学位 >Document analysis: Table structure understanding and zone content classification.
【24h】

Document analysis: Table structure understanding and zone content classification.

机译:文件分析:了解表结构和区域内容分类。

获取原文
获取原文并翻译 | 示例

摘要

For the last three decades, the document image analysis researchers have successfully developed many methods for character recognition, page segmentation of text-based documents. Most of these methods were not designed to handle documents containing complex objects, such as tables. We develop a table structure understanding system which can detect and decompose table structures from document images. Our algorithm use a background analysis technique to locate the table candidates and then validate them by using various measurements. An iterative optimization method is used to optimize the context probability. Our algorithm is probability based, where the probabilities are estimated from an extensive training set of various kinds of measurements of distances between the terminal and non-terminal entities with which the algorithm works. The off-line probabilities estimated in the training then drive all decisions in the on-line table structure understanding modules. We propose an experimental protocol that can simulate any given table ground truth with additional controlled variety. We present a table structure understanding performance evaluation protocol. Our algorithm reaches a 97.05% and 97.28% correct detection rates on cell and table levels, respectively.; We propose a new machine learning based approach for genuine table detection from generic web documents. We design a novel web document table ground truthing protocol and use it to build a large table ground truth database. Experiments on this database demonstrate a significant performance improvement over another rule-based system.; Given segmented zone entities and document image, zone content classification determines the zone types. Our zone content classification algorithms are evaluated on the University of Washington English Document Image Database-III. Using 25 features, we reach an accuracy rate of 98.45%.; We present a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document and partitions the glyphs into a set of text words. Experiments on the University of Washington English Document Image Database-III show our algorithm is significantly better than the other two competitive algorithms.
机译:在过去的三十年中,文档图像分析研究人员已成功开发出许多用于字符识别,基于文本的文档的页面分割的方法。这些方法大多数都不旨在处理包含复杂对象(例如表格)的文档。我们开发了一种表格结构理解系统,该系统可以检测和分解文档图像中的表格结构。我们的算法使用背景分析技术来定位候选表,然后通过使用各种度量对其进行验证。迭代优化方法用于优化上下文概率。我们的算法是基于概率的,其中概率是从广泛的训练集中估算出来的,该训练集对与该算法一起工作的终端实体与非终端实体之间的距离进行了各种测量。然后,训练中估计的离线概率将驱动在线表结构理解模块中的所有决策。我们提出了一种实验协议,该协议可以模拟任何给定的表格地面真相并具有其他受控种类。我们提出了一个了解性能评估协议的表结构。我们的算法在单元格和表级别分别达到97.05%和97.28%的正确检测率。我们提出了一种基于机器学习的新方法,用于从通用Web文档中进行真正的表格检测。我们设计了一种新颖的Web文档表基础事实协议,并使用它来构建大型表基础事实数据库。在该数据库上进行的实验表明,与另一个基于规则的系统相比,性能有了显着提高。给定分段的区域实体和文档图像,区域内容分类确定区域类型。我们的区域内容分类算法在华盛顿大学英语文档图像数据库-III中进行了评估。使用25个功能,我们的准确率达到98.45%。我们提出了一种文本单词提取算法,该算法采用一组字形的边界框及其给定文档的关联文本行,并将字形划分为一组文本字。华盛顿大学英语文档图像数据库-III上的实验表明,我们的算法明显优于其他两种竞争算法。

著录项

  • 作者

    Wang, Yalin.;

  • 作者单位

    University of Washington.;

  • 授予单位 University of Washington.;
  • 学科 Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2002
  • 页码 161 p.
  • 总页数 161
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 无线电电子学、电信技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号