首页> 外文会议>Asia-Pacific Signal and Information Processing Association Annual Summit and Conference >Support Vector Machine (SVM) based classifier for Khmer Printed Character-set Recognition
【24h】

Support Vector Machine (SVM) based classifier for Khmer Printed Character-set Recognition

机译:基于支持向量机(SVM)的高棉印刷字符集识别分类器

获取原文

摘要

This paper describes on the use of Support Vector Machine (SVM) based classification method on Khmer Printed Character-set Recognition (PCR) in bitmap document. Khmer language has been identified as one of the most complex language with the total of 74 alphabets and the wording compound can has up to 5 vertical levels. This paper proposes one new method, SVM for Khmer character classification system by using 3 different SVM kernels (Gaussian, Polynomial and Linear Kernel) on data training and recognition to find out the best kernel for Khmer language. The method allows us to use small training dataset by training different pieces of character training instead of training big amount of clusters. The classification uses binary data of 0 as white space and 1 as black pixel area of the character; each training piece of character has been stretched into a matrix of the binary data in all kinds of image size. Feature extraction is extracted from the matrix to use in SVM classification. After recognition, there are some rules to combine each cluster or character by using character levels or common mistake correction. The experiment of about pure 750 Khmer words or around 3000 characters show that SVM method with Gaussian Kernel produces a good result with better performance among all kernels. The system uses one font "Khmer OS Content" of the training data with font size 32pt to recognize 3 different font sizes. The accuracy of 28pt font size is 98.17%, 32pt is 98.62% and 36pt is 98.54% respectively.
机译:本文介绍了在位图文档中基于支持向量机(SVM)的分类方法在高棉印刷字符集识别(PCR)上的使用。高棉语已被认为是最复杂的语言之一,总共有74个字母,并且措辞组合最多可以有5个垂直等级。本文提出了一种新的高棉字符分类系统支持向量机,该方法利用3种不同的SVM内核(高斯,多项式和线性内核)进行数据训练和识别,从而找到适用于高棉语言的最佳内核。该方法允许我们通过训练不同的角色训练来使用小的训练数据集,而不是训练大量的聚类。分类使用字符0的空白数据和字符的黑色像素区域的1二进制数据;每个角色训练片段都已扩展为各种图像尺寸的二进制数据矩阵。从矩阵中提取特征提取以用于SVM分类。识别后,有一些规则可以通过使用字符级别或常见错误纠正来组合每个聚类或字符。对大约750个高棉单词或大约3000个字符的实验表明,使用高斯内核的SVM方法在所有内核中产生了良好的结果,并且具有更好的性能。系统使用训练数据的一种字体“高棉OS内容”,字体大小为32pt,以识别3种不同的字体大小。 28pt字体大小的准确性分别为98.17%,32pt为98.62%和36pt为98.54%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号