首页> 外文会议>International conference on machine vision >A Unified Approach for Development of Urdu Corpus for OCR and Demographic Purpose
【24h】

A Unified Approach for Development of Urdu Corpus for OCR and Demographic Purpose

机译:用于OCR和人口统计目的的乌尔都语语料库的统一开发方法

获取原文

摘要

This paper presents a methodology for the development of an Urdu handwritten text image Corpus and application of Corpus linguistics in the field of OCR and information retrieval from handwritten document. Compared to other language scripts, Urdu script is little bit complicated for data entry. To enter a single character it requires a combination of multiple keys entry. Here, a mixed approach is proposed and demonstrated for building Urdu Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like Passport, Ration Card, Voting Card, AADHAR, Driving licence, Indian Railway Reservation, Census data etc. This would increase the participation of Urdu language community in understanding and taking benefit of the Government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking.
机译:本文提出了乌尔都语手写文本图像语料库的开发方法,以及语料库语言学在OCR和手写文档信息检索领域的应用。与其他语言脚本相比,乌尔都语脚本的数据输入有点复杂。要输入单个字符,需要组合多个按键。在这里,提出并演示了一种混合方法,用于构建用于OCR和人口统计数据收集的Urdu语料库。数据库的人口统计部分可以用来训练系统自动获取数据,这将有助于简化数据收集领域中涉及的现有手动数据处理任务,例如输入表单(如护照,口粮卡,投票卡,AADHAR) ,驾驶执照,印度铁路保留,人口普查数据等。这将增加乌尔都语社区对政府计划的理解和利用。为了使数据库在广泛的语料库语言学中具有可用性和适用性,我们提出了一种数据收集,标记,数字转录和XML元数据信息进行基准测试的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号