首页> 外文会议>International Conference on Machine Vision >A Unified Approach for Development of Urdu Corpus for OCR and Demographic Purpose
【24h】

A Unified Approach for Development of Urdu Corpus for OCR and Demographic Purpose

机译:对OCR和人口目的乌尔都语语料库发展的统一方法

获取原文

摘要

This paper presents a methodology for the development of an Urdu handwritten text image Corpus and application of Corpus linguistics in the field of OCR and information retrieval from handwritten document. Compared to other language scripts, Urdu script is little bit complicated for data entry. To enter a single character it requires a combination of multiple keys entry. Here, a mixed approach is proposed and demonstrated for building Urdu Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like Passport, Ration Card, Voting Card, AADHAR, Driving licence, Indian Railway Reservation, Census data etc. This would increase the participation of Urdu language community in understanding and taking benefit of the Government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking.
机译:本文介绍了开发乌尔都语手写文本图像语料库的方法,以及从手写文档的OCR领域中的语料库语言学应用程序。与其他语言脚本相比,URDU脚本对于数据输入几乎没有复杂。要输入单个字符,它需要多个键条目的组合。在这里,提出了一种混合方法,并对OCR和人口统计数据收集构建URDU语料库。数据库的人口统计部分可用于训练系统自动获取数据,这将有助于简化数据收集领域中的现有手动数据处理任务,例如Passport,配油卡,投票卡,Aadhar等输入形式。 ,驾驶执照,印度铁路预订,人口普查数据等这将增加乌尔都语语言社区参与的政府计划的理解和利益。为了在大广域语言学中进行数据库的可用性和适用性,我们提出了一种用于基准测试的数据收集,标记,数字转录和XML元数据信息的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号