Abstract: Frequently object recognition accuracy is a key component in the performance analysis of pattern matching systems. In the past three years, the results of numerous excellent and rigorous studies of OCR system typeset-character accuracy (henceforth OCR accuracy) have been published, encouraging performance comparisons between a variety of OCR products and technologies. These published figures are important; OCR vendor advertisements in the popular trade magazines lead readers to believe that published OCR accuracy figures effect market share in the lucrative OCR market. Curiously, a detailed review of many of these OCR error occurrence counting results reveals that they are not reproducible as published and they are not strictly comparable due to larger variances in the counts than would be expected by the sampling variance. Naturally, since OCR accuracy is based on a ratio of the number of OCR errors over the size of the text searched for errors, imprecise OCR error accounting leads to similar imprecision in OCR accuracy. Some published papers use informal, non-automatic, or intuitively correct OCR error accounting. Still other published results present OCR error accounting methods based on string matching algorithms such as dynamic programming using Levenshtein (edit) distance but omit critical implementation details (such as the existence of suspect markers in the OCR generated output or the weights used in the dynamic programming minimization procedure). The problem with not specifically revealing the accounting method is that the number of errors found by different methods are significantly different. This paper identifies the basic accounting methods used to measure OCR errors in typeset text and offers an evaluation and comparison of the various accounting methods. !36
展开▼