首页> 外文学位 >Document analysis: Table structure understanding and zone content classification.

【24h】

Document analysis: Table structure understanding and zone content classification.

机译：文件分析：了解表结构和区域内容分类。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

For the last three decades, the document image analysis researchers have successfully developed many methods for character recognition, page segmentation of text-based documents. Most of these methods were not designed to handle documents containing complex objects, such as tables. We develop a table structure understanding system which can detect and decompose table structures from document images. Our algorithm use a background analysis technique to locate the table candidates and then validate them by using various measurements. An iterative optimization method is used to optimize the context probability. Our algorithm is probability based, where the probabilities are estimated from an extensive training set of various kinds of measurements of distances between the terminal and non-terminal entities with which the algorithm works. The off-line probabilities estimated in the training then drive all decisions in the on-line table structure understanding modules. We propose an experimental protocol that can simulate any given table ground truth with additional controlled variety. We present a table structure understanding performance evaluation protocol. Our algorithm reaches a 97.05% and 97.28% correct detection rates on cell and table levels, respectively.; We propose a new machine learning based approach for genuine table detection from generic web documents. We design a novel web document table ground truthing protocol and use it to build a large table ground truth database. Experiments on this database demonstrate a significant performance improvement over another rule-based system.; Given segmented zone entities and document image, zone content classification determines the zone types. Our zone content classification algorithms are evaluated on the University of Washington English Document Image Database-III. Using 25 features, we reach an accuracy rate of 98.45%.; We present a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document and partitions the glyphs into a set of text words. Experiments on the University of Washington English Document Image Database-III show our algorithm is significantly better than the other two competitive algorithms.

机译：在过去的三十年中，文档图像分析研究人员已成功开发出许多用于字符识别，基于文本的文档的页面分割的方法。这些方法大多数都不旨在处理包含复杂对象（例如表格）的文档。我们开发了一种表格结构理解系统，该系统可以检测和分解文档图像中的表格结构。我们的算法使用背景分析技术来定位候选表，然后通过使用各种度量对其进行验证。迭代优化方法用于优化上下文概率。我们的算法是基于概率的，其中概率是从广泛的训练集中估算出来的，该训练集对与该算法一起工作的终端实体与非终端实体之间的距离进行了各种测量。然后，训练中估计的离线概率将驱动在线表结构理解模块中的所有决策。我们提出了一种实验协议，该协议可以模拟任何给定的表格地面真相并具有其他受控种类。我们提出了一个了解性能评估协议的表结构。我们的算法在单元格和表级别分别达到97.05％和97.28％的正确检测率。我们提出了一种基于机器学习的新方法，用于从通用Web文档中进行真正的表格检测。我们设计了一种新颖的Web文档表基础事实协议，并使用它来构建大型表基础事实数据库。在该数据库上进行的实验表明，与另一个基于规则的系统相比，性能有了显着提高。给定分段的区域实体和文档图像，区域内容分类确定区域类型。我们的区域内容分类算法在华盛顿大学英语文档图像数据库-III中进行了评估。使用25个功能，我们的准确率达到98.45％。我们提出了一种文本单词提取算法，该算法采用一组字形的边界框及其给定文档的关联文本行，并将字形划分为一组文本字。华盛顿大学英语文档图像数据库-III上的实验表明，我们的算法明显优于其他两种竞争算法。

著录项

作者
Wang, Yalin.;
展开▼
作者单位

University of Washington.;

展开▼
授予单位 University of Washington.;
学科 Engineering Electronics and Electrical.
学位 Ph.D.
年度 2002
页码 161 p.
总页数 161
原文格式 PDF
正文语种 eng
中图分类无线电电子学、电信技术;
关键词

相似文献

外文文献
中文文献
专利

1. Table form document analysis based on the document structure grammar [J] . Akira Amano, Naoki Asada, Masayuki Mukunoki, International Journal on Document Analysis and Recognition . 2006,第2a3期

机译：基于文档结构语法的表格文档分析
2. Content enrichment with expressive document modelling to leverage the understanding of unstructured data [J] . Ganesh Selvaraj, Karla Taboada, Eloy Gonzales, MATEC Web of Conferences . 2019,第1期

机译：利用表达性文档建模来丰富内容，以充分利用对非结构化数据的理解
3. Understanding the historical institutional context by using content analysis of local policy and planning documents: Assessing the interactions between tourism and landscape on the Island of Terschelling in the Wadden Sea Region [J] . Heslinga Jasper, Groote Peter, Vanclay Frank Tourism management . 2018,第JUNa期

机译：通过对地方政策和计划文件进行内容分析来了解历史制度背景：评估瓦登海地区特尔西林岛的旅游业与景观之间的相互作用
4. Document understanding using probabilistic relaxation: application on tables of contents of periodicals [C] . Le Bourgeois, F., Emptoz, . 2001

机译：使用概率松弛进行文档理解：在期刊目录中的应用
5. Explication of Political User-Generated Content and Theorizing about Its Effects on Democracy with a Mix-of-Attributes Approach and Documenting Attribute Presence with a Quantitative Content Analysis. [D] . Dylko, Ivan B. 2011

机译：阐述政治用户生成的内容，并通过属性混合方法对其政治影响进行理论化，并通过定量内容分析来记录属性存在。
6. Every document and picture tells a story: using internal corporate document reviews semiotics and content analysis to assess tobacco advertising [O] . S J Anderson, T Dewhirst, P M Ling 2006

机译：每个文档和图片都讲述一个故事：使用内部公司文档审查符号学和内容分析来评估烟草广告
7. Table Understanding in Structured Documents [O] . Martin Holecek, Antonin Hoskovec, Petr Baudis, 2019

机译：表在结构化文件中的理解

Document analysis: Table structure understanding and zone content classification.

摘要

著录项

相似文献

相关主题

期刊订阅