首页> 外文会议>International conference on intelligent information technology application;IITA2010 >Persian/Arabic document Segmentation Based on Pyramidal Image Structure
【24h】

Persian/Arabic document Segmentation Based on Pyramidal Image Structure

机译:基于金字塔形图像结构的波斯/阿拉伯文档分割

获取原文

摘要

Automatic transformation of paper documents into electronic documents requires document segmentation at the first stage. However, some parameters restrictions such as variations in character font sizes, different text line spacing, and also not uniform document layout structures altogether have made it difficult to design a general-purpose document layout analysis algorithm for many years. Thus in most previously reported methods it is inevitable to include these parameters. This problem becomes excessively acute and severe, especially in Persian/Arabic documents. Since the Persian/Arabic scripts differ considerably from the English scripts, most of the proposed methods for the English scripts do not render good results for the Persian scripts. In this paper, we present a novel parameter-free method for segmenting the Persian/Arabic document images which also works well for English scripts. This method segments the document image into maximal homogeneous regions and identifies them as texts and non-texts based on a pyramidal image structure. In other words the proposed method is capable of document segmentation without considering the character font sizes, text line spacing, and document layout structures. This algorithm is examined for 150 Arabic/Persian and English documents and document segmentation process are done successfully for 96 percent of documents.
机译:将纸质文档自动转换为电子文档需要在第一阶段对文档进行分段。然而,许多参数限制,例如字符字体大小的变化,不同的文本行间距以及不完全统一的文档布局结构,使得多年以来难以设计通用的文档布局分析算法。因此,在大多数先前报道的方法中,不可避免地要包括这些参数。这个问题变得非常严重和严重,特别是在波斯/阿拉伯文文件中。由于波斯/阿拉伯语脚本与英语脚本有很大的不同,因此大多数针对英语脚本的建议方法都无法为波斯语脚本提供良好的效果。在本文中,我们提出了一种新颖的无参数的分割波斯/阿拉伯文档图像的方法,该方法也适用于英文脚本。此方法将文档图像分割为最大均匀区域,并根据金字塔图像结构将其识别为文本和非文本。换句话说,所提出的方法能够在不考虑字符字体大小,文本行间距和文档布局结构的情况下进行文档分割。已针对150个阿拉伯/波斯和英语文档检查了该算法,并成功完成了96%的文档的文档分割过程。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号