首页> 外文期刊>Journal of ICT Research and Applications >A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition
【24h】

A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition

机译:打印的阿拉伯语PAW图像数据库,用于文档分析和识别

获取原文
           

摘要

Document image analysis and recognition are important topics in the field of artificial intelligence. In this context, the availability of a database with good script samples is an important requirement for machine-learning processes. For Latin and Asian languages many suitable databases exist. However, there is a shortage of databases with Arabic samples. In this work, a new database of printed Arabic text is introduced. The new concept of collecting sub-words (PAWs) instead of words or individual character samples was adopted. These PAWs constitute all words in the Arabic language. The collected database consists of 83,056 images of PAWs extracted from approximately 550,000 different words. Each sample is presented in the database in five font types: Thuluth, Naskh, Andalusi, Typing Machine, and Kufi. In total, the database consists of 415,280 images. Moreover, ground truth information is included with each PAW image to describe its occurrence number, occurrence frequency, positions and the shapes of the characters. This paper presents a statistical analysis of the frequency of each PAW in the Arabic language.
机译:文档图像分析和识别是人工智能领域的重要主题。在这种情况下,具有良好脚本示例的数据库的可用性是机器学习过程的重要要求。对于拉丁和亚洲语言,存在许多合适的数据库。但是,缺乏带有阿拉伯文样本的数据库。在这项工作中,引入了一个印刷阿拉伯文字的新数据库。采用了收集子词(PAW)代替词或单个字符样本的新概念。这些PAW构成了阿拉伯语中的所有单词。收集的数据库包括从大约550,000个不同的单词中提取的83,056张PAW图像。每个样本以五种字体类型显示在数据库中:Thuluth,Naskh,Andalusi,Typing Machine和Kufi。该数据库总共包含415,280张图像。此外,每个PAW图像都包含地面真实信息,以描述其出现次数,出现频率,字符的位置和形状。本文对阿拉伯语中每个PAW的频率进行了统计分析。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号