...
【24h】

Optical Character Recognition for printed Tamil text using Unicode

机译:使用Unicode印刷的泰米尔文字的光学字符识别

获取原文
获取原文并翻译 | 示例
           

摘要

Optical Character Recognition (OCR) refers to the process of converting printed Tamil text documents into software translated Unicode Tamil Text. The printed documents available in the form of books, papers, magazines, etc. are scanned using ostandard scanners which produce an image of the scanned document. As part of the preprocessing phase the image file is checked for skewing. If the image is skewed, it is corrected by a simple rotation technique in the appropriate direction. Then the image is passed through a noise elimination phase and is binarized. The preprocessed image is segmented using an algorithm which decomposes the scanned text into paragraphs using special space detection technique and then the paragraphs into lines using vertical histograms, and lines into words using horizontal histograms, and words into character image glyphs using horizontal histograms. Each image glyph is comprised of 32x32 pixels. Thus a database of character image glyphs is created out of the segmentation phase. Then all the image glyphs are considered for recognition using Unicode mapping. Each image glyph is passed through various routines which extract the features of the glyph. The various features that are considered for classification are the character height, character width, the number of horizontal lines (long and short), the number of vertical lines (long and short), the horizontally oriented curves, the vertically oriented curves, the number of circles, number of slope lines, image centroid and special dots. The glyphs are now set ready for classification based on these features. The extracted features are passed to a Support Vector Machine (SVM) where the characters are classified by Supervised Learning Algorithm. These classes are mapped onto Unicode for recognition. Then the text is reconstructed using Unicode fonts.
机译:光学字符识别(OCR)是指将打印的泰米尔文本文档转换为软件翻译的Unicode泰米尔文本的过程。使用标准扫描仪扫描书籍,纸张,杂志等形式的可用打印文档,该扫描仪会生成扫描文档的图像。作为预处理阶段的一部分,将检查图像文件是否倾斜。如果图像歪斜,则可以通过简单的旋转技术在适当的方向上进行校正。然后,图像经过噪声消除阶段并被二值化。预处理的图像使用一种算法进行分割,该算法使用特殊的空间检测技术将扫描的文本分解为段落,然后使用垂直直方图将这些段落分解为行,使用水平直方图将其分解为单词,然后使用水平直方图将这些单词分解为字符图像字形。每个图像字形都由32x32像素组成。因此,在分割阶段之外创建了字符图像字形数据库。然后考虑使用Unicode映射对所有图像字形进行识别。每个图像字形都通过各种例程来提取字形的特征。考虑用于分类的各种特征是字符高度,字符宽度,水平线的数量(长和短),垂直线的数量(长和短),水平方向的曲线,垂直方向的曲线,数量圆,倾斜线的数量,图像质心和特殊点。现在可以根据这些功能设置字形以进行分类。提取的特征将传递到支持向量机(SVM),在其中通过监督学习算法对字符进行分类。这些类被映射到Unicode以进行识别。然后,使用Unicode字体重建文本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号