首页> 外文会议>Document Recognition III >Extraction of text-related features for condensing image documents

【24h】

Extraction of text-related features for condensing image documents

机译：提取文本相关特征以压缩图像文档

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Abstract: A system has been built that selects excerpts from a scanned document for presentation as a summary, without using character recognition. The method relies on the idea that the most significant sentences in a document contain words that are both specific to the document and have a relatively high frequency of occurrence within it. Accordingly, and entirely within the image domain, each page image is deskewed and the text regions of are found and extracted as a set of textblocks. Blocks with font size near the median for the document are selected and then placed in reading order. The textlines and words are segmented, and the words are placed into equivalence classes of similar shape. The sentences are identified by finding baselines for each line of text and analyzing the size and location of the connected components relative to the baseline. Scores can then be given to each word, depending on its shape and frequency of occurrence, and to each sentence, depending on the scores for the words in the sentence. Other salient features, such as textblocks that have a large font or are likely to contain an abstract, can also be used to select image parts that are likely to be thematically relevant. The method has been applied to a variety of documents, including articles scanned from magazines and technical journals. !13

机译：摘要：已经建立了一个系统，该系统无需使用字符识别就可以从扫描的文档中选择摘要作为摘要显示。该方法基于这样的思想，即文档中最重要的句子包含特定于该文档并且在文档中出现频率相对较高的单词。因此，并且完全在图像域内，对每个页面图像进行去偏斜，并且找到的文本区域并将其提取为一组文本块。选择字体大小接近文档中位数的块，然后按阅读顺序放置。将文本行和单词分段，并将单词放入相似形状的等价类中。通过找到每行文本的基线并分析所连接组件相对于基线的大小和位置来识别句子。然后，可以根据单词的形状和出现频率为每个单词赋予分数，并根据句子中单词的得分为每个句子赋予分数。其他显着特征（例如，字体较大或可能包含摘要的文本块）也可以用于选择可能与主题相关的图像部分。该方法已应用于多种文档，包括从杂志和技术期刊扫描的文章。！13

著录项

来源
《Document Recognition III》|1996年|p.72-88|共17页
会议地点
作者
Dan S. Bloomberg; Xerox Palo Alto Research Ctr.; Palo Alto; CA; USA; Francine R. Chen; Xerox Palo Alto Research Ctr.; Palo Alto; CA; USA.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. A Structural Analysis Based Feature Extraction Method for OCR System For Myanmar Printed Document Images [J] . Htwe Pa Pa Win, Phyo Thu Thu Khine, KhinNweNi Tun International journal of computer vision and iImage processing . 2012,第1期

机译：基于结构分析的缅甸印刷文档图像OCR系统特征提取方法
2. Document vector representations for feature extraction in multi-stage document ranking [J] . Nima Asadi, Jimmy Lin Information Retrieval . 2013,第6期

机译：多阶段文档排名中用于特征提取的文档矢量表示
3. Document vector representations for feature extraction in multi-stage document ranking [J] . Nima Asadi, Jimmy Lin Information retrieval . 2013,第6期

机译：多阶段文档排名中用于特征提取的文档矢量表示
4. Extraction of text-related features for condensing image documents [C] . Dan S. Bloomberg, Francine R. Chen Conference on document recognition . 1996

机译：提取文本相关的特征，用于凝结图像文档
5. Feature extraction from the image of straight-edge objects and dynamic image/feature classification using non-iterative neural networks. [D] . Chanekasit, Sirikanlaya. 2004

机译：使用非迭代神经网络从直边物体的图像中提取特征并进行动态图像/特征分类。
6. Quantitative Image Feature Engine (QIFE): an Open-Source Modular Engine for 3D Quantitative Feature Extraction from Volumetric Medical Images [O] . Sebastian Echegaray, Shaimaa Bakr, Daniel L. Rubin, 2018

机译：定量图像特征引擎（QIFE）：一个开源的模块化引擎用于从体积医学图像中提取3D定量特征
7. Extraction of Text-Related Features for Condensing Image Documents [O] . Dan S. Bloomberg, Francine R. Chen 1996

机译：压缩文本文档的文本相关特征的提取

Extraction of text-related features for condensing image documents

摘要

著录项

相似文献

相关主题

期刊订阅