In this paper, we present a solution towards building a retrieval system over handwritten document images that i) is recognition-free, ii) allows text-querying, iii) can retrieve at sub-word level, iv) can search for out-of-vocabulary words. Unlike previous approaches that operate at either character or word levels, we use character n-gram images (CNG-img) as the retrieval primitive. CNG-img are sequences of character segments, that are represented and matched in the image-space. The word-images are now treated as a bag-of-CNG-img, that can be indexed and matched in the feature space. This allows for recognition-free search (query-by-example), which can retrieve morphologically similar words that have matching sub-words. Further, to enable query-by-keyword, we build an automated scheme to generate labeled exemplars for characters and character n-grams, from unconstrained handwritten documents. We pose this problem as one of weakly-supervised learning, where character/n-gram labeling is obtained automatically from the word labels. The resulting retrieval system can answer queries from an unlimited. vocabulary. The approach is demonstrated on the George Washington collection, results show major improvement in retrieval performance as compared to word-recognition and word-spotting methods.
展开▼