Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents

机译：在异构文件集合中同时布局样式和逻辑实体识别

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods de- grades dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where lay- out style information is explicitly used in both training and recognition phases. We represent the layout style, local fea- tures, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimi- larity of two document pages is represented by the distance between their respective trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clus- tering. During the recognition phase, the layout style and logical entities of an input document are recognized simul- taneously by matching the input tree to the trees in closest- matched layout style cluster of training set. Experimental results show that our algorithm is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.

机译：由于许多不同布局样式的情况下，文档页面图像的异构集合中的逻辑实体识别仍然是一个具有挑战性的问题。在本文中，我们提出了一种无人监督的方法，其中在训练和识别阶段明确地使用了布局样式信息。通过订购标记的X-y树，表示简洁地描述了文档的物理区域的布局样式，本地功能和逻辑标签。两个文档页面的样式消除由各自的树木之间的距离表示。在培训阶段，培训集中具有真正逻辑标签的文档页面被无监督的次数分为不同的布局样式。在识别阶段期间，通过将输入树与最近的布局样式培训集群中的树木匹配，通过将输入树与树相匹配来识别输入文档的布局样式和逻辑实体。实验结果表明，我们的算法具有较强的平衡和不平衡的群集尺寸，区域过分分割，区域长度变化和相同布局样式的树表示的变化。

著录项

来源
《International Conference on Document Analysis and Recognition》|2007年||共5页
会议地点
作者
Chen S.; Mao S.; Thoma G.; ICDAR;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP391.41-53;
关键词
入库时间 2022-08-20 21:13:23

相似文献

外文文献
中文文献
专利

1. The Potential of IFLA LRM and RDA Key Entities for Identification of Entities in Textual Documents of Cultural Heritage: The RunA Collection [J] . Anita Rasmane, Anita Goldberga Cataloging & classification quarterly . 2020,第5a8期

机译：IFLA LRM和RDA关键实体的潜力，用于识别文化遗产文本文档中的实体：RUNA集合
2. Fingerprints for Imposed Layers in Document Images Based on Huffman Code and Logical Layout Analysis [J] . Surabhi Narayan, Sahana D Gowda Journal of Pattern Recognition and Intelligent Systems . 2016,第1期

机译：基于霍夫曼代码和逻辑布局分析的文档图像中拼版图层的指纹
3. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents [J] . Aniko Simon, A. Peter Johnson Journal of Chemical Information and Computer Sciences . 1997,第1期

机译：CLiDE项目的最新进展：化学文件的逻辑布局分析
4. Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents [C] . Chen S., Mao S., Thoma G., International Conference on Document Analysis and Recognition . 2007

机译：在异构文件集合中同时布局样式和逻辑实体识别
5. Searching heterogeneous document image collections [D] . Jain, Rajiv 2015

机译：搜索异类文档图像集合
6. Clinical Documents: Attribute-Values Entity Representation Context Page Layout And Communication [O] . Christian Lovis, Alexander Lamb, Robert Baud, 2003

机译：临床文档：属性-值实体表示上下文页面布局和交流
7. Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents [O] . 2008

机译：异构文档集中的同时布局样式和逻辑实体识别

Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents

摘要

著录项

相似文献

相关主题

期刊订阅