Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain

机译：直接在行程压缩域中自动提取相关熵特征以进行文本文档分析

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Automatic feature extraction plays a pivotal role in defining the overall performance of any Document Image Analysis system, which conventionally operates directly over uncompressed images, although most of the real time systems such as fax machines, digital libraries and e-governance applications accrue and archive the documents in the compressed form for the sake of storage and transfer efficiencies. However, this infers that the compressed documents need to be decompressed before carrying out any operation or analysis which warrants additional computing resources. This limitation in existing systems instigates motivation to explore for feature extraction techniques directly from the compressed documents and eventually design a document analysis system that works directly in compressed domain. Therefore, this research work proposes to extract novel correlation-entropy features directly from run-length compressed TIFF documents. Further, the research work also investigates different methods to demonstrate some of the straight forward application of the proposed features in carrying out compressed document image analysis such as text and non-text component detection, and subsequently performing compressed text line segmentation and characterization, all carried out in the compressed version of the printed text document without going through the stage of decompression. Finally, the experimental results reported validate the developed algorithms and also illustrate that the proposed features are quite powerful in distinguishing compressed text and non-text components.

机译：自动特征提取在定义任何文档图像分析系统的总体性能中起着关键作用，该系统通常直接对未压缩图像进行操作，尽管大多数实时系统（例如传真机，数字图书馆和电子政务应用程序）都可以累积和存档文档。压缩形式的文档，以提高存储和传输效率。但是，这意味着在执行任何保证额外计算资源的操作或分析之前，需要对压缩文档进行解压缩。现有系统中的这种局限性促使人们有动力直接从压缩文档中探索特征提取技术，并最终设计出可直接在压缩域中工作的文档分析系统。因此，这项研究工作提出直接从游程压缩的TIFF文档中提取新颖的相关熵特征。此外，研究工作还研究了各种方法，以证明所提出的功能在进行压缩文档图像分析（如文本和非文本成分检测）以及随后执行压缩文本行分割和特征化方面的一些直接应用。在不经过解压缩阶段的情况下，以打印文本文档的压缩版本输出。最后，报告的实验结果验证了所开发的算法，并且还说明了所提出的功能在区分压缩文本和非文本组件方面非常强大。

著录项

来源
《International Conference on Document Analysis and Recognition》|2015年|1-5|共5页
会议地点
作者
Javed Mohammed; Nagabhushan P.; Chaudhuri B.B.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
correlation methods; data compression; document image processing; entropy; feature extraction; image classification; image segmentation; text analysis; automatic correlation-entropy feature extraction; compressed document image analysis; compressed text line characterization; compressed text line segmentation; computing resources; correlation-entropy features; digital libraries; document analysis system; document image analysis system; e-governance applications; feature extraction techniques; nontext component detection; real time systems; run-length compressed TIFF documents; run-length compressed domain; text document analysis; uncompressed images; Image coding; Image segmentation; Compressed document feature extraction; Compressed document image processing; Correlation-entropy analysis; Run-length compressed domain;

机译：相关方法;数据压缩;文档图像处理;熵;特征提取;图像分类;图像分割;文本分析;自动相关熵特征提取;压缩文档图像分析;压缩文本行表征;压缩文本行分段;计算资源;相关熵特征;数字图书馆;文档分析系统;文档图像分析系统;电子政务应用程序;特征提取技术;非文本成分检测;实时系统;行程长度压缩的TIFF文档;行程长度压缩域;文本文档分析;未压缩图像;图像编码;图像分割;压缩文档特征提取;压缩文档图像处理;相关熵分析;游程压缩域;

相似文献

外文文献
中文文献
专利

1. Visualizing CCITT Group 3 and Group 4 TIFF Documents and Transforming to Run-Length Compressed Format Enabling Direct Processing in Compressed Domain [J] . Mohammed Javed, S.H. Krishnanand, P. Nagabhushan, Procedia Computer Science . 2016,第1期

机译：可视化CCITT第3组和第4组TIFF文档，并转换为可在压缩域中直接处理的运行时压缩格式
2. A review on document image analysis techniques directly in the compressed domain [J] . Javed Mohammed, Nagabhushan P., Chaudhuri Bidyut B. Artificial Intelligence Review: An International Science and Engineering Journal . 2018,第4期

机译：直接在压缩域中的文档图像分析技术综述
3. Deep Text Mining for Automatic Keyphrase Extraction from Text Documents [J] . Muhammad Abulaish, Jahiruddin, Lipika Dey Journal of Intelligent Systems . 2011,第4期

机译：深度文本挖掘，用于从文本文档中自动提取关键词
4. Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain [C] . Javed Mohammed, Nagabhushan P., Chaudhuri B.B. International Conference on Document Analysis and Recognition . 2015

机译：直接在运行长度压缩域中自动提取文本文档分析的相关熵特征
5. Noun phrases in documents: Preprocessing, automatic extraction, and statistical analysis in different categories of text. [D] . Kim, Youngin. 2002

机译：文档中的名词短语：对不同类别的文本进行预处理，自动提取和统计分析。
6. Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents [O] . Deepak Agnihotri, Kesari Verma, Priyanka Tripathi -1

机译：计算N-gram的对称强度：文本文档自动分类中的两遍过滤方法
7. Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents [O] . Javed, Mohammed, Nagabhushan, P., Chaudhuri, B. B. 2014

机译：提取投影轮廓，运行直方图和熵特征直接从运行长度压缩文本文档
8. Almost Automatic Semantic Feature Extraction from Technical Text. [R] . Agarwal, R. 1994

机译：从技术文本中提取几乎自动语义特征。

Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain

摘要

著录项

相似文献

相关主题

期刊订阅