首页> 外文期刊>International Journal on Document Analysis and Recognition >CMATERdbl: a database of unconstrained handwritten Bangla and Bangla-English mixed script document image
【24h】

CMATERdbl: a database of unconstrained handwritten Bangla and Bangla-English mixed script document image

机译:CMATERdbl:一个不受约束的手写孟加拉语和英语-英语混合脚本文档图像的数据库

获取原文
获取原文并翻译 | 示例
           

摘要

In this paper, we have described the preparation of a benchmark database for research on off-line Optical Character Recognition (OCR) of document images of handwritten Bangla text and Bangla text mixed with English words. This is the first handwritten database in this area, as mentioned above, available as an open source document. As India is a multi-lingual country and has a colonial past, so multi-script document pages are very much common. The database contains 150 handwritten document pages, among which 100 pages are written purely in Bangla script and rests of the 50 pages are written in Bangla text mixed with English words. This database for off-line-handwritten scripts is collected from different data sources. After collecting the document pages, all the documents have been preprocessed and distributed into two groups, I.e., CMATERdb1. 1.1, con- taining document pages written in Bangla script only, and CMATERdb1.2..1, containing document pages written in Bangla text mixed with English words. Finally, we have also provided the useful ground truth images for the line segmentation purpose. To generate the ground truth images, we have first labeled each line in a document page automatically by applying one of our previously developed line extraction techniques [Khandelwal et al., PReMI 2009, pp. 369-374] and then corrected any possible error by using our developed tool GT Gen 1.1. Line extraction accuracies of 90.6 and 92.38% are achieved on the two databases, respectively, using our algorithm.
机译:在本文中,我们已经描述了基准数据库的准备,该数据库用于研究手写孟加拉文本和孟加拉文本与英语单词混合的文档图像的离线光学字符识别(OCR)。如上所述,这是该领域的第一个手写数据库,可以作为开源文档使用。由于印度是一个使用多种语言的国家,并且具有殖民历史,因此多脚本文档页面非常普遍。该数据库包含150个手写文档页面,其中100个页面完全使用Bangla脚本编写,其余50个页面则使用Bangla文本和英语单词混合书写。用于离线手写脚本的数据库是从不同的数据源收集的。收集文档页面之后,所有文档都已进行了预处理,并分为两组,即CMATERdb1。 1.1,仅包含以孟加拉语脚本编写的文档页面,以及CMATERdb1.2..1,其中包含以孟加拉语文本和英语单词混合编写的文档页面。最后,我们还提供了有用的地面真实图像,用于线段分割。为了生成基本事实图像,我们首先通过应用我们先前开发的一种行提取技术[Khandelwal等人,PReMI 2009,第369-374页]自动标记文档页面中的每一行,然后通过以下方法纠正任何可能的错误:使用我们开发的工具GT Gen 1.1。使用我们的算法,在两个数据库上的线提取精度分别达到90.6和92.38%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号