【24h】

Text line script identification for a tri-lingual document

机译:三语种文档的文本行脚本标识

获取原文
获取外文期刊封面目录资料

摘要

India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For example in Karnataka, a state in India, the document may contain text lines in English, Hindi script. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 15 different document images containing about 450 text lines. For a new text line, necessary features are extracted from the horizontal projection profile and compared with the stored knowledge base to classify the script. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 99.83% is achieved.
机译:印度是一个多语言,多语言的国家。印度各州遵循三种语言的公式。该文档可以用英语,北印度语和其他官方语言打印。例如,在印度的一个州卡纳塔克邦,文档可能包含英语,印地语脚本的文本行。对于这种多语言文档的光学字符识别(OCR),有必要在将文本行馈送到各个脚本的OCR之前识别脚本。本文提出了一种简单有效的脚本识别技术,用于从打印文档中识别卡纳达语,北印度语和英语文本行。所提出的系统使用水平投影轮廓来区分这三个脚本。基于每个文本行的水平投影轮廓来完成特征提取。该系统的知识库是基于15个不同的文档图像(包含约450个文本行)开发的。对于新的文本行,从水平投影轮廓中提取必要的特征,并将其与存储的知识库进行比较以对脚本进行分类。所提出的系统在包含每个脚本约200个文本行的20个不同文档图像上进行了测试,总体分类率为99.83%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号