首页> 外文会议>International conference on Asian language processing >Information extraction and text mining of Ancient Vattezhuthu characters in historical documents using image zoning
【24h】

Information extraction and text mining of Ancient Vattezhuthu characters in historical documents using image zoning

机译:使用图像分区对历史文献中的古代Vattezhuthu字符进行信息提取和文本挖掘

获取原文

摘要

The aim of this paper is to develop a system that involves character recognition of Brahmi, Grantha and Vattezuthu characters from palm manuscripts of historical Tamil ancient documents, analyzed the text and machine translated the present Tamil digital text format. Though many researchers have implemented various algorithms and techniques for character recognition in different languages, ancient characters conversion still poses a big challenge. Because image recognition technology has reached near-perfection when it comes to scanning English and other language text. But optical character recognition (OCR) software capable of digitizing printed Tamil text with high levels of accuracy is still elusive. Only a few people are familiar with the ancient characters and make attempts to convert them into written documents manually. The proposed system overcomes such a situation by converting all the ancient historical documents from inscriptions and palm manuscripts into Tamil digital text format. It converts the digital text format using Tamil unicode. Our algorithm comprises different stages: i) image preprocessing, ii) feature extraction, iii) character recognition and iv) digital text conversion. The first phase conversion accuracy of the Brahmi script rate of our algorithm is 91.57% using the neural network and image zoning method. The second phase of the Vattezhuthu character set is to be implemented. Conversion accuracy of Vattezhuthu is 89.75%.
机译:本文的目的是开发一个系统,该系统涉及从泰米尔历史悠久的古代文献的手抄本中识别梵语,格兰塔和瓦特祖图字符的字符,对文本进行分析并用机器翻译当前的泰米尔数字文本格式。尽管许多研究人员已经实现了用于不同语言的字符识别的各种算法和技术,但是古代字符转换仍然带来了很大的挑战。因为图像识别技术在扫描英语和其他语言文本时已接近完美。但是,光学字符识别(OCR)软件能够对印刷的泰米尔文字进行高度数字化的精确度仍然遥不可及。只有少数人熟悉古代字符,并尝试将其手动转换为书面文档。拟议的系统通过将所有古代历史文献(从铭文和手稿)转换为泰米尔语数字文本格式来克服了这种情况。它使用泰米尔语unicode转换数字文本格式。我们的算法包括不同的阶段:i)图像预处理,ii)特征提取,iii)字符识别和iv)数字文本转换。使用神经网络和图像分区方法,我们算法的Brahmi脚本率的第一阶段转换精度为91.57%。 Vattezhuthu字符集的第二阶段将实现。 Vattezhuthu的转换精度为89.75%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号