Information extraction and text mining of Ancient Vattezhuthu characters in historical documents using image zoning

机译：使用图像分区对历史文献中的古代Vattezhuthu字符进行信息提取和文本挖掘

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The aim of this paper is to develop a system that involves character recognition of Brahmi, Grantha and Vattezuthu characters from palm manuscripts of historical Tamil ancient documents, analyzed the text and machine translated the present Tamil digital text format. Though many researchers have implemented various algorithms and techniques for character recognition in different languages, ancient characters conversion still poses a big challenge. Because image recognition technology has reached near-perfection when it comes to scanning English and other language text. But optical character recognition (OCR) software capable of digitizing printed Tamil text with high levels of accuracy is still elusive. Only a few people are familiar with the ancient characters and make attempts to convert them into written documents manually. The proposed system overcomes such a situation by converting all the ancient historical documents from inscriptions and palm manuscripts into Tamil digital text format. It converts the digital text format using Tamil unicode. Our algorithm comprises different stages: i) image preprocessing, ii) feature extraction, iii) character recognition and iv) digital text conversion. The first phase conversion accuracy of the Brahmi script rate of our algorithm is 91.57% using the neural network and image zoning method. The second phase of the Vattezhuthu character set is to be implemented. Conversion accuracy of Vattezhuthu is 89.75%.

机译：本文的目的是开发一个系统，该系统涉及从泰米尔历史悠久的古代文献的手抄本中识别梵语，格兰塔和瓦特祖图字符的字符，对文本进行分析并用机器翻译当前的泰米尔数字文本格式。尽管许多研究人员已经实现了用于不同语言的字符识别的各种算法和技术，但是古代字符转换仍然带来了很大的挑战。因为图像识别技术在扫描英语和其他语言文本时已接近完美。但是，光学字符识别（OCR）软件能够对印刷的泰米尔文字进行高度数字化的精确度仍然遥不可及。只有少数人熟悉古代字符，并尝试将其手动转换为书面文档。拟议的系统通过将所有古代历史文献（从铭文和手稿）转换为泰米尔语数字文本格式来克服了这种情况。它使用泰米尔语unicode转换数字文本格式。我们的算法包括不同的阶段：i）图像预处理，ii）特征提取，iii）字符识别和iv）数字文本转换。使用神经网络和图像分区方法，我们算法的Brahmi脚本率的第一阶段转换精度为91.57％。 Vattezhuthu字符集的第二阶段将实现。 Vattezhuthu的转换精度为89.75％。

著录项

来源
《International conference on Asian language processing》|2016年|37-40|共4页
会议地点
作者
E.K. Vellingiriraj; M. Balamurugan; P. Balasubramanie;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Character recognition; Image segmentation; Feature extraction; Training; Databases; Image recognition; Cryptography;

机译：字符识别;图像分割;特征提取;训练;数据库;图像识别;密码学;

相似文献

外文文献
中文文献
专利

1. Word Extraction and Character Segmentation from Text Lines of Unconstrained Handwritten Bangla Document Images [J] . Ram Sarkar, Samir Malakar, Nibaran Das, Journal of Intelligent Systems . 2011,第3期

机译：从不受约束的手写孟加拉语文档图像的文本行中提取单词并进行字符分割
2. Text line extraction for historical document images [J] . Raid Saabni, Abedelkadir Asi, Jihad El-Sana Pattern recognition letters . 2014,第jana1期

机译：历史文档图像的文本行提取
3. Text Extraction from Historical Document Images by the Combination of Several Thresholding Techniques [J] . Toufik Sari, Abderrahmane Kefali, Halima Bahi Advances in multimedia . 2014,第期

机译：结合多种阈值技术从历史文献图像中提取文本
4. Information extraction and text mining of Ancient Vattezhuthu characters in historical documents using image zoning [C] . E.K. Vellingiriraj, M. Balamurugan, P. Balasubramanie International conference on Asian language processing . 2016

机译：使用图像分区历史文档中古代Vattezhuthu字符的信息提取与文本挖掘
5. Extraction of Text Objects in Image and Video Documents. [D] . Zhang, Jing. 2012

机译：提取图像和视频文档中的文本对象。
6. Text Extraction from Scene Images by Character Appearance and Structure Modeling [O] . Chucai Yi, Yingli Tian -1

机译：通过字符外观和结构建模从场景图像提取文本
7. Text Analysis and Information Retrieval of Historical Tamil Ancient Documents Using Machine Translation in Image Zoning [O] . E. K. Vellingiriraj, M. Balamurugan, P. Balasubramanie 2016

机译：用机器翻译在图像分区中的文本分析与信息检索历史泰米尔古代文件

Information extraction and text mining of Ancient Vattezhuthu characters in historical documents using image zoning

摘要

著录项

相似文献

相关主题

期刊订阅