首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >Recognition of printed arabic text based on global features and decision tree learning techniques
【24h】

Recognition of printed arabic text based on global features and decision tree learning techniques

机译:基于全局特征和决策树学习技术的印刷阿拉伯文本识别

获取原文
获取原文并翻译 | 示例
           

摘要

Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and off-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research field. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between different fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The first step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction. where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character etc., to avoid the difficulty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with different fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%. (C) 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. [References: 44]
机译:近三十年来,人类阅读的机器模拟一直是深入研究的主题。大量的研究论文和报告已经发表在拉丁文,中文和日文字符上。但是,在在线和离线自动识别阿拉伯语方面,几乎没有进行任何工作,以实现对阿拉伯字符的自动识别。这是由于缺乏足够的资金支持以及诸如阿拉伯文本数据库,字典等其他实用程序的支持,并且当然是由于其编写规则的草书性质,这个问题仍然是一个开放的研究领域。本文提出了一种使用C4.5机器学习系统识别阿拉伯文字的新技术。机器学习的优点是双重的:它可以概括不同字体和书写样式之间的较大差异,并且可以通过示例构建识别规则。该技术可以分为三个主要步骤。第一步是数字化和预处理,以创建连接的组件,检测文档图像的偏斜并进行校正。第二,特征提取。其中使用输入的阿拉伯词的全局特征来提取特征,例如子词数,子词内的峰数,补充字符的数和位置等,以避免分割阶段的困难。最后,机器学习C4.5用于生成用于对每个单词进行分类的决策树。该系统使用1000种不同字体的阿拉伯语单词(每个单词有15个样本)进行了测试,使用交叉验证获得的正确平均识别率为92%。 (C)2000模式识别学会。由Elsevier Science Ltd.出版。保留所有权利。 [参考:44]

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号