A Unified Approach for Development of Urdu Corpus for OCR and Demographic Purpose

机译：对OCR和人口目的乌尔都语语料库发展的统一方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents a methodology for the development of an Urdu handwritten text image Corpus and application of Corpus linguistics in the field of OCR and information retrieval from handwritten document. Compared to other language scripts, Urdu script is little bit complicated for data entry. To enter a single character it requires a combination of multiple keys entry. Here, a mixed approach is proposed and demonstrated for building Urdu Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like Passport, Ration Card, Voting Card, AADHAR, Driving licence, Indian Railway Reservation, Census data etc. This would increase the participation of Urdu language community in understanding and taking benefit of the Government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking.

机译：本文介绍了开发乌尔都语手写文本图像语料库的方法，以及从手写文档的OCR领域中的语料库语言学应用程序。与其他语言脚本相比，URDU脚本对于数据输入几乎没有复杂。要输入单个字符，它需要多个键条目的组合。在这里，提出了一种混合方法，并对OCR和人口统计数据收集构建URDU语料库。数据库的人口统计部分可用于训练系统自动获取数据，这将有助于简化数据收集领域中的现有手动数据处理任务，例如Passport，配油卡，投票卡，Aadhar等输入形式。，驾驶执照，印度铁路预订，人口普查数据等这将增加乌尔都语语言社区参与的政府计划的理解和利益。为了在大广域语言学中进行数据库的可用性和适用性，我们提出了一种用于基准测试的数据收集，标记，数字转录和XML元数据信息的方法。

著录项

来源
《International Conference on Machine Vision》|2015年||共5页
会议地点
作者
Prakash Choudhary; Neeta Nain; Mushtaq Ahmed;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 N-532;
关键词
Corpus annotation; Demographic data collection; Handwritten document analysis; Urdu Corpus; OCR;

机译：语料库注释;人口统计数据收集;手写文档分析;乌尔都语语料库;OCR;

相似文献

外文文献
中文文献
专利

1. Nastalique segmentation-based approach for Urdu OCR [J] . Hussain Sarmad, Ali Salman, Akram Qurat ul Ain International Journal on Document Analysis and Recognition . 2015,第4期

机译：基于Nastalique细分的Urdu OCR方法
2. Salience Analysis of NEWS Corpus using Heuristic Approach in Urdu Language [J] . S. Abbas Ali, M. Daniyal Noor, Munir Ahmed Javed, International journal of computer science and network security . 2016,第4期

机译：启发式方法用乌尔都语语言对NEWS语料库进行显着性分析
3. Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair [J] . Haneef Israr, Nawab Rao Muhammad Adeel, Munir Ehsan Ullah, Scientific programming . 2019,第PTa1期

机译：乌尔都语 - 英语对大型交叉抄袭语料库的设计与开发
4. A Unified Approach for Development of Urdu Corpus for OCR and Demographic Purpose [C] . Prakash Choudhary, Neeta Nain, Mushtaq Ahmed International conference on machine vision . 2015

机译：用于OCR和人口统计目的的乌尔都语语料库的统一开发方法
5. A multimodal fusion approach for automatic postal address recognition system using Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) techniques. [D] . Singh, Amriteshwar. 2011

机译：一种使用光学字符识别（OCR）和自动语音识别（ASR）技术的自动邮政地址识别系统的多模式融合方法。
6. The Information Product Methods: A Unified Approach to Dual-Purpose Computerized Adaptive Testing [O] . Chanjin Zheng, Guanrui He, Chunlei Gao 2018

机译：信息产品方法：双重目的计算机自适应测试的统一方法
7. A Segmentation Free Approach to Arabic and Urdu OCR [O] . Nazly Sabbour, Faisal Shafait 2014

机译：阿拉伯语和乌尔都语OCR的无分割方法
8. Further Developments in a Hierarchical Bayes Approach to Small Area Estimation of Health Insurance Coverage: State-Level Estimates for Demographic Groups [R] . Bauder, M., Riesz, S., Luery, D. 2009

机译：用于小区域估算健康保险的等级贝叶斯方法的进一步发展：人口群体的州级估计

A Unified Approach for Development of Urdu Corpus for OCR and Demographic Purpose

摘要

著录项

相似文献

相关主题

期刊订阅