首页> 外文会议>International Conference on Document Analysis and Recognition >Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain
【24h】

Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain

机译:直接在行程压缩域中自动提取相关熵特征以进行文本文档分析

获取原文

摘要

Automatic feature extraction plays a pivotal role in defining the overall performance of any Document Image Analysis system, which conventionally operates directly over uncompressed images, although most of the real time systems such as fax machines, digital libraries and e-governance applications accrue and archive the documents in the compressed form for the sake of storage and transfer efficiencies. However, this infers that the compressed documents need to be decompressed before carrying out any operation or analysis which warrants additional computing resources. This limitation in existing systems instigates motivation to explore for feature extraction techniques directly from the compressed documents and eventually design a document analysis system that works directly in compressed domain. Therefore, this research work proposes to extract novel correlation-entropy features directly from run-length compressed TIFF documents. Further, the research work also investigates different methods to demonstrate some of the straight forward application of the proposed features in carrying out compressed document image analysis such as text and non-text component detection, and subsequently performing compressed text line segmentation and characterization, all carried out in the compressed version of the printed text document without going through the stage of decompression. Finally, the experimental results reported validate the developed algorithms and also illustrate that the proposed features are quite powerful in distinguishing compressed text and non-text components.
机译:自动特征提取在定义任何文档图像分析系统的总体性能中起着关键作用,该系统通常直接对未压缩图像进行操作,尽管大多数实时系统(例如传真机,数字图书馆和电子政务应用程序)都可以累积和存档文档。压缩形式的文档,以提高存储和传输效率。但是,这意味着在执行任何保证额外计算资源的操作或分析之前,需要对压缩文档进行解压缩。现有系统中的这种局限性促使人们有动力直接从压缩文档中探索特征提取技术,并最终设计出可直接在压缩域中工作的文档分析系统。因此,这项研究工作提出直接从游程压缩的TIFF文档中提取新颖的相关熵特征。此外,研究工作还研究了各种方法,以证明所提出的功能在进行压缩文档图像分析(如文本和非文本成分检测)以及随后执行压缩文本行分割和特征化方面的一些直接应用。在不经过解压缩阶段的情况下,以打印文本文档的压缩版本输出。最后,报告的实验结果验证了所开发的算法,并且还说明了所提出的功能在区分压缩文本和非文本组件方面非常强大。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号