We propose a lexicon based method whose purpose is correcting a word recognized by an OCR engine (a classifier). This postprocessing method was originally designed to be used for language models that support diacritical marks, such as Portuguese. Since these special marks can be confused with noise by the classifier, wrong predictions can be derived if only the top hypothesis per glyph of the original image is preserved. To cope with this, our method uses a filtering strategy to select the best hypotheses for each glyph, which are used to produce candidate queries. A best query is selected in terms of confidence rate and edit distance to the word. A similarity search method over the best query suggests a correction. Experiments show the method improves prediction accuracy considerably for Portuguese words correction.
展开▼