【24h】

Whole-Book Recognition

机译:全书识别

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Whole-book recognition is a document image analysis strategy that operates on the complete set of a book''s page images using automatic adaptation to improve accuracy. We describe an algorithm which expects to be initialized with approximate iconic and linguistic models—derived from (generally errorful) OCR results and (generally imperfect) dictionaries—and then, guided entirely by evidence internal to the test set, corrects the models which, in turn, yields higher recognition accuracy. The iconic model describes image formation and determines the behavior of a character-image classifier, and the linguistic model describes word-occurrence probabilities. Our algorithm detects “disagreements” between these two models by measuring cross entropy between 1) the posterior probability distribution of character classes (the recognition results resulting from image classification alone) and 2) the posterior probability distribution of word classes (the recognition results from image classification combined with linguistic constraints). We show how disagreements can identify candidates for model corrections at both the character and word levels. Some model corrections will reduce the error rate over the whole book, and these can be identified by comparing model disagreements, summed across the whole book, before and after the correction is applied. Experiments on passages up to 180 pages long show that when a candidate model adaptation reduces whole-book disagreement, it is also likely to correct recognition errors. Also, the longer the passage operated on by the algorithm, the more reliable this adaptation policy becomes, and the lower the error rate achieved. The best results occur when both the iconic and linguistic models mutually correct one another. We have observed recognition error rates driven down by nearly an order of magnitude fully automatically without supervision (or indeed without any user intervention or intera- tion). Improvement is nearly monotonic, and asymptotic accuracy is stable, even over long runs. If implemented naively, the algorithm runs in time quadratic in the length of the book, but random subsampling and caching techniques speed it up by two orders of magnitude with negligible loss of accuracy. Whole-book recognition has potential applications in digital libraries as a safe unsupervised anytime algorithm.
机译:全书识别是一种文档图像分析策略,可使用自动调整功能对一整套书籍的页面图像进行操作,以提高准确性。我们描述了一种算法,该算法期望使用近似的图标和语言模型初始化(从(通常有错误的)OCR结果和(通常是不完美的)字典中得出),然后,在测试集内部的完全证据的指导下,对模型进行校正反过来,产生更高的识别精度。图标模型描述了图像的形成并确定了字符图像分类器的行为,语言模型描述了单词出现的概率。我们的算法通过测量以下两者之间的交叉熵来检测这两个模型之间的“分歧”:1)字符类的后验概率分布(仅图像分类产生的识别结果)和2)单词类的后验概率分布(图像的识别结果)分类并结合语言限制)。我们展示了分歧如何在字符和单词级别上识别模型校正的候选对象。某些模型修正会降低整本书的错误率,可以通过在应用修正前后将模型分歧汇总到整本书中的总和来确定这些错误。对长达180页的文章进行的实验表明,当候选模型改编减少了整本书的分歧时,它也有可能纠正识别错误。同样,算法处理的通道越长,该自适应策略就越可靠,并且实现的错误率越低。当图标和语言模型相互纠正时,会产生最佳结果。我们观察到识别错误率完全自动降低了近一个数量级,而无需监督(或实际上没有任何用户干预或干预)。改进几乎是单调的,即使在长期运行中,渐近精度也是稳定的。如果天真地实现,该算法的运行时间是本书的二次方,但是随机子采样和缓存技术将其速度提高了两个数量级,而损失的准确性却可以忽略不计。全书识别作为一种安全,不受监督的随时算法,在数字图书馆中具有潜在的应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号