A Unified Approach for Development of Urdu Corpus for OCR and Demographic Purpose

机译：用于OCR和人口统计目的的乌尔都语语料库的统一开发方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents a methodology for the development of an Urdu handwritten text image Corpus and application of Corpus linguistics in the field of OCR and information retrieval from handwritten document. Compared to other language scripts, Urdu script is little bit complicated for data entry. To enter a single character it requires a combination of multiple keys entry. Here, a mixed approach is proposed and demonstrated for building Urdu Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like Passport, Ration Card, Voting Card, AADHAR, Driving licence, Indian Railway Reservation, Census data etc. This would increase the participation of Urdu language community in understanding and taking benefit of the Government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking.

机译：本文提出了乌尔都语手写文本图像语料库的开发方法，以及语料库语言学在OCR和手写文档信息检索领域的应用。与其他语言脚本相比，乌尔都语脚本的数据输入有点复杂。要输入单个字符，需要组合多个按键。在这里，提出并演示了一种混合方法，用于构建用于OCR和人口统计数据收集的Urdu语料库。数据库的人口统计部分可以用来训练系统自动获取数据，这将有助于简化数据收集领域中涉及的现有手动数据处理任务，例如输入表单（如护照，口粮卡，投票卡，AADHAR），驾驶执照，印度铁路保留，人口普查数据等。这将增加乌尔都语社区对政府计划的理解和利用。为了使数据库在广泛的语料库语言学中具有可用性和适用性，我们提出了一种数据收集，标记，数字转录和XML元数据信息进行基准测试的方法。

著录项

来源
《International conference on machine vision》|2015年|944526.1-944526.5|共5页
会议地点
作者
Prakash Choudhary; Neeta Nain; Mushtaq Ahmed;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Corpus annotation; Demographic data collection; Handwritten document analysis; Urdu Corpus; OCR;

机译：语料库注释;人口数据收集;手写文件分析;乌尔都语语料库;光学字符识别;

相似文献

外文文献
中文文献
专利

1. Nastalique segmentation-based approach for Urdu OCR [J] . Hussain Sarmad, Ali Salman, Akram Qurat ul Ain International Journal on Document Analysis and Recognition . 2015,第4期

机译：基于Nastalique细分的Urdu OCR方法
2. Salience Analysis of NEWS Corpus using Heuristic Approach in Urdu Language [J] . S. Abbas Ali, M. Daniyal Noor, Munir Ahmed Javed, International journal of computer science and network security . 2016,第4期

机译：启发式方法用乌尔都语语言对NEWS语料库进行显着性分析
3. Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair [J] . Haneef Israr, Nawab Rao Muhammad Adeel, Munir Ehsan Ullah, Scientific programming . 2019,第PTa1期

机译：乌尔都语 - 英语对大型交叉抄袭语料库的设计与开发
4. A Unified Approach for Development of Urdu Corpus for OCR and Demographic Purpose [C] . Prakash Choudhary, Neeta Nain, Mushtaq Ahmed International Conference on Machine Vision . 2015

机译：对OCR和人口目的乌尔都语语料库发展的统一方法
5. A multimodal fusion approach for automatic postal address recognition system using Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) techniques. [D] . Singh, Amriteshwar. 2011

机译：一种使用光学字符识别（OCR）和自动语音识别（ASR）技术的自动邮政地址识别系统的多模式融合方法。
6. The Information Product Methods: A Unified Approach to Dual-Purpose Computerized Adaptive Testing [O] . Chanjin Zheng, Guanrui He, Chunlei Gao 2018

机译：信息产品方法：双重目的计算机自适应测试的统一方法
7. A Segmentation Free Approach to Arabic and Urdu OCR [O] . Nazly Sabbour, Faisal Shafait 2014

机译：阿拉伯语和乌尔都语OCR的无分割方法
8. Further Developments in a Hierarchical Bayes Approach to Small Area Estimation of Health Insurance Coverage: State-Level Estimates for Demographic Groups [R] . Bauder, M., Riesz, S., Luery, D. 2009

机译：用于小区域估算健康保险的等级贝叶斯方法的进一步发展：人口群体的州级估计

A Unified Approach for Development of Urdu Corpus for OCR and Demographic Purpose

摘要

著录项

相似文献

相关主题

期刊订阅