Word Shape Tokens (WSTs) are tokens used to represent words based on the overall shape or contour of a word as it appears in printed text. A character shape code (CSC) mapping function is used to aggregate similarly shaped letters such as "g" and "y" into one single code to represent those letters. The rationale behind this is that it is far easier and more accurate to map a scanned image of a word or letter into its WST representation than it is to map into its full ASCII representation. In previous work we showed that user-mediated selection of WSTs for querying document images improved system performance. In the work reported here we use a statistically derived dataset to help determine whether or not a particular WST from a scanned document image actually matches a query term WST. We do this by comparing the preceding and following WSTs of the each WST in a document against previously collected frequency data for a large set of WST occurrences.
展开▼