首页> 外文会议>International Conference on Signal Image Technology Internet Based Systems >A Framework for Compilation of Multi-lingual Handwritten Database: Four Levels XML Ground-Truth
【24h】

A Framework for Compilation of Multi-lingual Handwritten Database: Four Levels XML Ground-Truth

机译:多语言手写数据库的编译框架:四层XML基础

获取原文

摘要

In this paper, we are presenting a semi-automatic framework for annotating multi-lingual handwritten texts document images. There is a significant need for a structure that can annotate the coordinate segmentation information of the text present in a handwritten document image to provide a platform for OCR algorithm evaluation. In this paper, we describe an XML based four level annotations of handwritten text image that contain the ground-truth information of script text image in Unicode format. In order to collect the huge amount of data for linguistic researchers, structure provide a way to store and annotate at different four levels: Image, Lines, Words and Characters which aids for benchmarking of various OCRs. Structure would be best source for compilation of an annotated handwritten corpora in systematic and scientific way by storing a labelling(markup) information of image script texts in a Unicode and an XML file format that encapsulates the bounding box pixel information of each level in a collaborative manner. The structure provides useful results based on the annotation for various quantitative and statistical corpus approaches to linguistic analysis.
机译:在本文中,我们提出了一种用于注释多语言手写文本文档图像的半自动框架。迫切需要一种结构,该结构可以注释手写文档图像中存在的文本的坐标分割信息,以提供用于OCR算法评估的平台。在本文中,我们描述了一种基于XML的手写文本图像的四级注释,其中包含Unicode格式的脚本文本图像的真实信息。为了为语言研究人员收集大量数据,结构提供了一种在四个不同级别上进行存储和注释的方法:图像,线条,单词和字符,这有助于对各种OCR进行基准测试。通过以Unicode和XML文件格式存储图像脚本文本的标记(标记)信息,该结构将是系统,科学地编译带注释的手写语料库的最佳来源,而XML文件格式则以协作的方式封装了每个级别的边界框像素信息方式。该结构基于注释为各种定量和统计语料库方法进行语言分析提供了有用的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号