首页> 外文会议>Character Recognition Technologies >Cross-validation comparison of NIST OCR databases
【24h】

Cross-validation comparison of NIST OCR databases

机译:NIST OCR数据库的交叉验证比较

获取原文
获取原文并翻译 | 示例

摘要

Abstract: lity of reference databases for optical character recognition is vital to the meaningful assessment of classification algorithms. NIST has produced two databases of segmented handprinted characters obtained from socially distinct writer populations. Two approaches to the comparison of the databases are described. The first uses the eigenvalue spectrum of the covariance matrix as an a priori measure of the variance intrinsic to the data. The second cross validates the datasets using classification error to quantify the difficulty of OCR. The eigenvalue spectra from the training partitions of the datasets are generated during the production of the Karhunen Loeve Transforms, the leading components of which are used as prototype features for a classifier. The eignespectra are used to quantify diversity of the character sets and the Bhattacharrya distance is used to measure class separability. The digits, uppers and lowers from the two populations of 500 writers are partitioned into N disjoint sets. The KL transforms of each such set are used for testing, while the remaining N-1 sets form the training prototypes for a PNN nearest neighbor classifier. Recognition error rates and their variances are calculated over the N partitions for both databases independently. This quantifies intra-database diversity. The inter-database results, or `cross' terms, obtained by training and testing on different databases, indicate the generality of the training set. The results for digits suggest that the second NIST database (used nominally for testing) is significantly harder than the first (training) set; the testing images are 11% more variant. The NIST training data classifies partitions of itself with 1.7% error, and the test set with 6.8% error. Conversely the test set generalizes to both itself and the training data with 3.5% error. This effect has also ben reported using non-NIST classifiers. !13
机译:摘要:大量用于光学字符识别的参考数据库对于有意义的分类算法评估至关重要。 NIST已创建了两个数据库,这些数据库是从社会上不同的作家群体中获得的分段手印字符的数据库。描述了两种比较数据库的方法。第一种将协方差矩阵的特征值谱用作数据固有方差的先验度量。第二个交叉使用分类误差验证了数据集,以量化OCR的难度。来自数据集训练分区的特征值谱是在产生Karhunen Loeve变换的过程中生成的,其主要成分用作分类器的原型特征。 eignespectra用于量化字符集的多样性,而Bhattacharrya距离用于度量类的可分离性。来自500个作家的两个总体的数字,上下位被分为N个不相交的集合。每个这样的集合的KL变换用于测试,而其余的N-1个集合形成PNN最近邻分类器的训练原型。分别针对两个数据库在N个分区上计算识别错误率及其方差。这量化了数据库内的多样性。通过在不同数据库上进行培训和测试而获得的数据库间结果或“交叉”术语表明了培训集的普遍性。数字结果表明,第二个NIST数据库(名义上用于测试)比第一个(训练)集难得多。测试图片的变体多了11%。 NIST训练数据对自己的分区进行分类,错误率为1.7%,对测试集进行分类的错误率为6.8%。相反,测试集可同时推广到自身和训练数据,误差为3.5%。还使用非NIST分类器报告了这种效果。 !13

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号