【24h】

A search engine for historical manuscript images

机译:历史手稿图像的搜索引擎

获取原文

摘要

Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore cannot be used for such material. We describe two statistical models for retrieval in large collections of handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a joint probability distribution between features computed from word images and their transcriptions. The models can then be used to retrieve unlabeled images of handwritten documents given a text query. We show experiments with a training set of 100 transcribed pages and a test set of 987 handwritten page images from the George Washington collection. Experiments show that the precision at 20 documents is about 0.4 to 0.5 depending on the model. To the best of our knowledge, this is the first automatic retrieval system for historical manuscripts using text queries, without manual transcription of the original corpus.
机译:许多博物馆和图书馆档案馆都在数字化其大量的手写历史手稿收藏,以使公众能够访问它们。这些收藏仅以图像格式提供,并且需要昂贵的手动注释工作才能访问它们。当前的手写识别器的单词错误率超过50%,因此不能用于此类材料。我们描述了两种统计模型,用于在给定文本查询的大量手写手稿中进行检索。两者都使用一组转录的页面图像来学习根据单词图像计算的特征及其转录之间的联合概率分布。然后,在进行文本查询的情况下,可以使用这些模型来检索手写文档的未标记图像。我们展示了一个实验集,其中包含100个转录页的训练集和乔治华盛顿馆藏的987个手写页图像的测试集。实验表明,取决于模型,在20个文档上的精度约为0.4到0.5。据我们所知,这是第一个使用文本查询自动检索历史手稿的系统,而无需人工转录原始语料库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号