This project seeks to combine state-of-the-art information visualization techniques with text image Cannon Quality Factors to characterize and discriminate among text documents and their digital images. It will provide a highly effective tool for characterization and management of a test corpus composed of over 1200 documents. The basic concept is that once characterized, it should be possible to visually identify regions of expected OCR accuracy and degree of OCR difficulty within the OCR Test Corpus using the Cannon Quality Factors. We have been working with an information visualization tool (dubbed "Parentage") to identify the appropriate metric data for the above purposes. Two very important potential applications of this work include the capability to (1) identify new research directions for OCR development, and (2) identify the most appropriate OCR commercial/system engine to use with a given set of documents.
展开▼