首页> 外文会议>International Conference on Document Analysis and Recognition >Making Two Vast Historical Manuscript Collections Searchable and Extracting Meaningful Textual Features Through Large-Scale Probabilistic Indexing
【24h】

Making Two Vast Historical Manuscript Collections Searchable and Extracting Meaningful Textual Features Through Large-Scale Probabilistic Indexing

机译:通过大规模概率索引,制作两个庞大的历史稿件收藏品并提取有意义的文本特征

获取原文

摘要

Textual access to large collections of digitized images remains unfeasible because usually they lack transcripts. Transcribing such collections is in turn typically unattainable in terms of costs. However, the use of probabilistic indices can facilitate textual accessing with only moderate demands of resources. Besides allowing effortless information retrieval, it will be shown that probabilistic indices can also be used to estimate textual features of the indexed but otherwise untranscribed collections, such as running words and Zipf's curves. Complete probabilistic indices have been recently produced for two iconic large collections: "Bentham" (90K images) and "Spanish Golden Age Theater" (40K images). To show the repercussion of making these collections searchable, we provide accessing statistics gathered through their corresponding search interfaces. To the best of our knowledge this is the first publication of large collections of untranscribed manuscripts which are now publicly accessible for effective and efficient textual access.
机译:对大量数字化图像的文本访问仍然是不可行的,因为通常他们缺乏成绩单。在成本方面,转向这些集合通常是无法实现的。然而,使用概率索引可以促进只需适度的资源需求的文本访问。除了允许轻松的信息检索之外,还将显示概率指数还可用于估计索引的文本特征,而否则是未经筛选的集合,例如运行单词和ZIPF的曲线。最近为两个标志性的大型收藏品制作了完整的概率指数:“Bentham”(90K图像)和“西班牙黄金时代剧院”(40K图像)。要显示使这些集合可搜索的影响,我们提供通过相应的搜索界面收集的访问统计信息。据我们所知,这是第一次出版大量未经筛查的手稿,现在可以公开访问有效和有效的文本访问。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号