Abstract: A system has been built that selects excerpts from a scanned document for presentation as a summary, without using character recognition. The method relies on the idea that the most significant sentences in a document contain words that are both specific to the document and have a relatively high frequency of occurrence within it. Accordingly, and entirely within the image domain, each page image is deskewed and the text regions of are found and extracted as a set of textblocks. Blocks with font size near the median for the document are selected and then placed in reading order. The textlines and words are segmented, and the words are placed into equivalence classes of similar shape. The sentences are identified by finding baselines for each line of text and analyzing the size and location of the connected components relative to the baseline. Scores can then be given to each word, depending on its shape and frequency of occurrence, and to each sentence, depending on the scores for the words in the sentence. Other salient features, such as textblocks that have a large font or are likely to contain an abstract, can also be used to select image parts that are likely to be thematically relevant. The method has been applied to a variety of documents, including articles scanned from magazines and technical journals. !13
展开▼