Abstract: This paper presents an experimental evaluation of several text-based methods for detecting duplication in document image databases using uncorrected OCR output. This task is challenging because of both the wide range of degradations printed documents can suffer, and conflicting interpretations of what it means to be a 'duplicate.' We report results for five sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail. !15
展开▼