首页> 外文期刊>ACM transactions on Asian language information processing >A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script
【24h】

A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script

机译:四层注释的乌尔都语手写文本图像数据集,用于乌尔都语脚本的多学科研究

获取原文
获取原文并翻译 | 示例
       

摘要

This article introduces a large handwritten text document image corpus dataset for Urdu script named CALAM (Cursive And Language Adaptive Methodologies). The database contains unconstrained handwritten sentences along with their structural annotations for the offline handwritten text images with their XML representation. Urdu is the fourth most frequently used language in the world, but due to its complex cursive writing script and low resources, it is still a thrust area for document image analysis. Here, a unified approach is applied in the development of an Urdu corpus by collecting printed texts, handwritten texts, and demographic information of writers on a single form. CALAM contains 1,200 handwritten text images, 3,043 lines, 46,664 words, and 101,181 ligatures. For capturing maximum variance among the words and handwritten styles, data collection is distributed among six categories and 14 subcategories. Handwritten forms were filled out by 725 different writers belonging to different geographical regions, ages, and genders with diverse educational backgrounds. A structure has been designed to annotate handwritten Urdu script images at line, word, and ligature levels with an XML standard to provide a ground truth of each image at different levels of annotation. This corpus would be very useful for linguistic research in benchmarking and providing a testbed for evaluation of handwritten text recognition techniques for Urdu script, signature verification, writer identification, digital forensics, classification of printed and handwritten text, categorization of texts as per use, and so on. The experimental results of some recently developed handwritten text line segmentation techniques experimented on the proposed dataset are also presented in the article for asserting its viability and usability.
机译:本文介绍了名为CALAM(递归和语言自适应方法)的乌尔都语脚本的大型手写文本文档图像语料库数据集。该数据库包含不受约束的手写句子以及带有XML表示的脱机手写文本图像的结构注释。乌尔都语是世界上第四常用的语言,但是由于其草书编写脚本复杂且资源匮乏,它仍然是文档图像分析的重点领域。在这里,通过以单一形式收集印刷文本,手写文本和作家的人口统计学信息,统一的方法被应用于Urdu语料库的开发中。 CALAM包含1,200个手写文本图像,3,043行,46,664个单词和101,181个连字。为了捕获单词和手写样式之间的最大差异,数据收集分布在六个类别和14个子类别中。 725名不同地理背景,年龄和性别的作家以不同的教育背景填写了手写表格。设计了一种结构,可以使用XML标准在行,单词和连字级别对手写的Urdu脚本图像进行注释,以提供每个图像在不同注释级别的基本信息。该语料库对于基准测试中的语言学研究非常有用,并为评估乌尔都语手写文字识别技术,签名验证,作者身份,数字取证,印刷和手写文字的分类,按使用对文字的分类提供评估的测试平台,以及以此类推。在文章中还介绍了一些最近开发的手写文本行分割技术的实验结果,以证明其可行性和可用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号