...
首页> 外文期刊>Journal of computer sciences >DAD: A Detailed Arabic Dataset for Online Text Recognition and Writer Identification, a New Type
【24h】

DAD: A Detailed Arabic Dataset for Online Text Recognition and Writer Identification, a New Type

机译:爸爸:用于在线文本识别和作家识别的详细阿拉伯语数据集,一种新型

获取原文
           

摘要

This paper presents a novel Arabic dataset that considers the characteristics of the Arabic language filling some gaps not covered by existing datasets. Conventional datasets consider Arabic in a similar way to Latin languages. These datasets either delete diacritic and supplement marks, considering them as defects, or keep them without considering the actual meaning. More than half of all Arabic characters have diacritics above or below characters. In this context, this work presents the novel Detailed Arabic Dataset (DAD) for bridging these gaps. The additional marks included in this dataset are the single dot, two dots "-", three dots "^", Hamza and two supplement marks: The bar for Tah, or Zah and the complement bar for Kaf. A special application was built to generate a dataset for Arabic online recognition and writer identification (called OFMArabicDatasetBuilder). Totally the ground truth contains 93064 entries based on sub-word and letter parts (not on words or lines as other datasets). This dataset will provide researchers with a strong tool for online Arabic language text recognition especially in the segmentation phase and writer identification. This paper also presents benchmarking results of using k-nearest neighbours machine learning with DAD.
机译:本文介绍了一种新的阿拉伯语数据集,它考虑了阿拉伯语的特征,填充了现有数据集未涵盖的一些空白。传统的数据集认为以类似的方式与拉丁语语言类似。这些数据集要么删除了读音器和补充标记,将它们视为缺陷,或者在不考虑实际含义的情况下保持。超过一半的阿拉伯字符在字符上方或低于字符之上。在这种情况下,这项工作提出了新颖的详细阿拉伯语数据集(爸爸),用于弥合这些差距。该数据集中包含的附加标记是单点,两个点“ - ”,三点“^”,Hamza和两种补充标记:TAH的棒,或ZAH和KAF的补体栏。建立一个特殊应用程序,以生成阿拉伯在线识别和作者标识的数据集(称为MarabicDatasetBuilder)。完全基于子字和字母部件的地面真相包含93064条目(不是单词或行为其他数据集)。此数据集将为研究人员提供强大的工具,用于在线阿拉伯语文本识别,特别是在分割阶段和作者识别中。本文还介绍了使用爸爸使用K-Indell邻居机器学习的基准测试结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号