首页> 外文会议>International Conference on Document Analysis and Recognition >Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents
【24h】

Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents

机译:在异构文件集合中同时布局样式和逻辑实体识别

获取原文

摘要

Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods de- grades dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where lay- out style information is explicitly used in both training and recognition phases. We represent the layout style, local fea- tures, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimi- larity of two document pages is represented by the distance between their respective trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clus- tering. During the recognition phase, the layout style and logical entities of an input document are recognized simul- taneously by matching the input tree to the trees in closest- matched layout style cluster of training set. Experimental results show that our algorithm is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.
机译:由于许多不同布局样式的情况下,文档页面图像的异构集合中的逻辑实体识别仍然是一个具有挑战性的问题。在本文中,我们提出了一种无人监督的方法,其中在训练和识别阶段明确地使用了布局样式信息。通过订购标记的X-y树,表示简洁地描述了文档的物理区域的布局样式,本地功能和逻辑标签。两个文档页面的样式消除由各自的树木之间的距离表示。在培训阶段,培训集中具有真正逻辑标签的文档页面被无监督的次数分为不同的布局样式。在识别阶段期间,通过将输入树与最近的布局样式培训集群中的树木匹配,通过将输入树与树相匹配来识别输入文档的布局样式和逻辑实体。实验结果表明,我们的算法具有较强的平衡和不平衡的群集尺寸,区域过分分割,区域长度变化和相同布局样式的树表示的变化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号