首页> 外文会议>International conference on Asian language processing >Information extraction and text mining of Ancient Vattezhuthu characters in historical documents using image zoning
【24h】

Information extraction and text mining of Ancient Vattezhuthu characters in historical documents using image zoning

机译:使用图像分区历史文档中古代Vattezhuthu字符的信息提取与文本挖掘

获取原文

摘要

The aim of this paper is to develop a system that involves character recognition of Brahmi, Grantha and Vattezuthu characters from palm manuscripts of historical Tamil ancient documents, analyzed the text and machine translated the present Tamil digital text format. Though many researchers have implemented various algorithms and techniques for character recognition in different languages, ancient characters conversion still poses a big challenge. Because image recognition technology has reached near-perfection when it comes to scanning English and other language text. But optical character recognition (OCR) software capable of digitizing printed Tamil text with high levels of accuracy is still elusive. Only a few people are familiar with the ancient characters and make attempts to convert them into written documents manually. The proposed system overcomes such a situation by converting all the ancient historical documents from inscriptions and palm manuscripts into Tamil digital text format. It converts the digital text format using Tamil unicode. Our algorithm comprises different stages: i) image preprocessing, ii) feature extraction, iii) character recognition and iv) digital text conversion. The first phase conversion accuracy of the Brahmi script rate of our algorithm is 91.57% using the neural network and image zoning method. The second phase of the Vattezhuthu character set is to be implemented. Conversion accuracy of Vattezhuthu is 89.75%.
机译:本文的目的是开发一个系统,涉及来自历史泰米尔古代文件的Palm手稿的Brahmi,Grantha和Vattezuthu字符的性格识别,分析了文本和机器翻译了当前泰米尔数字文本格式。虽然许多研究人员已经实施了不同语言的角色识别的各种算法和技术,但古代角色转换仍然造成了一项重大挑战。因为在扫描英语和其他语言文本时,图像识别技术已达到近完美。但光学字符识别(OCR)软件能够以高精度数字化印刷的泰米尔文本仍然难以捉摸。只有少数人熟悉古代角色,并试图手动将它们转换为书面文件。拟议的系统通过将铭文和手掌稿转换为泰米尔数字文本格式,克服了这种情况。它使用Tamil Unicode转换数字文本格式。我们的算法包括不同的阶段:i)图像预处理,ii)特征提取,iii)字符识别和iv)数字文本转换。使用神经网络和图像分区方法,我们算法的Brahmi脚本速率的第一阶段转换精度为91.57%。 Vattezhuthu字符集的第二阶段将实现。 Vattezhuthu的转换精度为89.75%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号