首页> 外文会议>Language Engineering Conference >Page layout analyser for multilingual Indian documents
【24h】

Page layout analyser for multilingual Indian documents

机译:页面布局分析仪用于多语种印度文件

获取原文

摘要

An advanced Optical Character Recognition (OCR) system is equipped with the module of the page layout analyser. It separates textual zones from non-textual zones. It identifies textual blocks from multicolumn documents and groups them into homogenous regions in terms of geometric shape and spatial distribution. All existing OCR modules developed for various Indian scripts can handle text only single-column documents. In this paper, a page layout analyser that uses typical common features present in most of the Indian scripts is introduced. A simple compatibility criterion that allows various degrees of homogeneity is defined. The page-analyser is robust in the sense that it can distinguish text regions from non-textual entities such as images, rulers, and noisy signals due to smudges and poor quality of the paper. Test results are shown in two most popular Indian Scripts, Devnagari (Hindi) and Bangla.
机译:先进的光学字符识别(OCR)系统配备了页面布局分析器的模块。它将文本区域与非文本区域分开。它在几何形状和空间分布方面识别来自多枚文档的文本块,并将其分组成同质区域。为各种印度脚本开发的所有现有的OCR模块都可以仅处理单列文档。在本文中,介绍了使用大多数印度脚本中存在的典型共同特征的页面布局分析器。定义了允许各种均匀性的简单兼容性标准。 Page-Analyzer在它的意义上是强大的,即它可以将文本区域与诸如图像,标尺和噪声质量差的图像,尺子和噪声信号等非文本实体区分开来。测试结果显示在两个最受欢迎的印度剧本,Devnagari(印地语)和Bangla中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号