Optical Character Recognition for printed Tamil text using Unicode

SEETHALAKSHMI R.; SREERANJANI T.R.; BALACHANDAR T.

首页> 外文期刊>Journal of Zhejiang University Science: An international applied physics & engineering journal >Optical Character Recognition for printed Tamil text using Unicode

【24h】

Optical Character Recognition for printed Tamil text using Unicode

机译：使用Unicode印刷的泰米尔文字的光学字符识别

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Optical Character Recognition (OCR) refers to the process of converting printed Tamil text documents into software translated Unicode Tamil Text. The printed documents available in the form of books, papers, magazines, etc. are scanned using ostandard scanners which produce an image of the scanned document. As part of the preprocessing phase the image file is checked for skewing. If the image is skewed, it is corrected by a simple rotation technique in the appropriate direction. Then the image is passed through a noise elimination phase and is binarized. The preprocessed image is segmented using an algorithm which decomposes the scanned text into paragraphs using special space detection technique and then the paragraphs into lines using vertical histograms, and lines into words using horizontal histograms, and words into character image glyphs using horizontal histograms. Each image glyph is comprised of 32x32 pixels. Thus a database of character image glyphs is created out of the segmentation phase. Then all the image glyphs are considered for recognition using Unicode mapping. Each image glyph is passed through various routines which extract the features of the glyph. The various features that are considered for classification are the character height, character width, the number of horizontal lines (long and short), the number of vertical lines (long and short), the horizontally oriented curves, the vertically oriented curves, the number of circles, number of slope lines, image centroid and special dots. The glyphs are now set ready for classification based on these features. The extracted features are passed to a Support Vector Machine (SVM) where the characters are classified by Supervised Learning Algorithm. These classes are mapped onto Unicode for recognition. Then the text is reconstructed using Unicode fonts.

机译：光学字符识别（OCR）是指将打印的泰米尔文本文档转换为软件翻译的Unicode泰米尔文本的过程。使用标准扫描仪扫描书籍，纸张，杂志等形式的可用打印文档，该扫描仪会生成扫描文档的图像。作为预处理阶段的一部分，将检查图像文件是否倾斜。如果图像歪斜，则可以通过简单的旋转技术在适当的方向上进行校正。然后，图像经过噪声消除阶段并被二值化。预处理的图像使用一种算法进行分割，该算法使用特殊的空间检测技术将扫描的文本分解为段落，然后使用垂直直方图将这些段落分解为行，使用水平直方图将其分解为单词，然后使用水平直方图将这些单词分解为字符图像字形。每个图像字形都由32x32像素组成。因此，在分割阶段之外创建了字符图像字形数据库。然后考虑使用Unicode映射对所有图像字形进行识别。每个图像字形都通过各种例程来提取字形的特征。考虑用于分类的各种特征是字符高度，字符宽度，水平线的数量（长和短），垂直线的数量（长和短），水平方向的曲线，垂直方向的曲线，数量圆，倾斜线的数量，图像质心和特殊点。现在可以根据这些功能设置字形以进行分类。提取的特征将传递到支持向量机（SVM），在其中通过监督学习算法对字符进行分类。这些类被映射到Unicode以进行识别。然后，使用Unicode字体重建文本。

著录项

来源
《Journal of Zhejiang University Science: An international applied physics & engineering journal》 |2005年第11期|共9页
作者
SEETHALAKSHMI R.; SREERANJANI T.R.; BALACHANDAR T.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自然科学总论;
关键词
OCR; Unicode; Features; Support Vector Machine (SVM); Artificial Neural Networks;

机译：OCR;Unicode;功能;支持向量机（SVM）;人工神经网络;

相似文献

外文文献
中文文献
专利

1. Optical Character Recognition for printed Tamil text using Unicode [J] . SEETHALAKSHMI R., SREERANJANI T.R., BALACHANDAR T. Journal of Zhejiang University Science: An international applied physics & engineering journal . 2005,第11期

机译：使用Unicode印刷的泰米尔文字的光学字符识别
2. Optical Character Recognition for printed Tamil text using Unicode [J] . SEETHALAKSHMI R., SREERANJANI T.R., BALACHANDAR T., 浙江大学学报（英文版）（A辑：应用物理和工程） . 2005,第011期

机译：使用Unicode印刷的泰米尔文字的光学字符识别
3. Offline Character Recognition of Printed Tamil Text using Template Matching Method of Bamini Tamil Font [J] . D. Pugazhenthi, S. Arul Vallarasi Indian Journal of Science and Technology . 2015,第35期

机译：使用Bamini Tamil字体模板匹配方法的印刷泰米尔文字离线字符识别
4. Recognition of Hand written and Printed Text of Cursive Writing Utilizing Optical Character Recognition [C] . Sudharshan Duth P, Amulya B International Conference on Intelligent Computing and Control Systems . 2020

机译：利用光学字符识别的草书手写体和印刷文本
5. Optical Character Recognition of Printed Persian/Arabic Documents. [D] . Shafii, Mahnaz. 2014

机译：印刷的波斯/阿拉伯文档的光学字符识别。
6. Correction: A Method of Neighbor Classes Based SVM Classification for Optical Printed Chinese Character Recognition [O] . Jie Zhang, Xiaohong Wu, Yanmei Yu, -1

机译：校正：一种基于邻类的支持向量机分类的光学印刷汉字识别方法
7. Optical Character Recognition for printed Tamil text using Unicode [O] . Seethalakshmi R, Sreeranjani T. R, Balachandar T, 2005

机译：使用Unicode打印泰米尔语文本的光学字符识别
8. A Feasibility Test of the Graphix I Optical Character Recognition System for the Capture of Printed Cyrillic Text [R] . Griffith, A. K., Ham, R., Schroeppel, R., 1979

机译：Graphix I光学字符识别系统捕获印刷西里尔文本的可行性测试

Optical Character Recognition for printed Tamil text using Unicode

摘要

著录项

相似文献

相关主题

期刊订阅