Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods de- grades dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where lay- out style information is explicitly used in both training and recognition phases. We represent the layout style, local fea- tures, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimi- larity of two document pages is represented by the distance between their respective trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clus- tering. During the recognition phase, the layout style and logical entities of an input document are recognized simul- taneously by matching the input tree to the trees in closest- matched layout style cluster of training set. Experimental results show that our algorithm is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.
展开▼