首页> 外文期刊>International Journal on Document Analysis and Recognition >User-configurable Ocr Enhancement For Online Natural History Archives
【24h】

User-configurable Ocr Enhancement For Online Natural History Archives

机译:在线自然历史档案的用户可配置Ocr增强

获取原文
获取原文并翻译 | 示例
           

摘要

The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and cultural fields, and is not satisfied either by off-the-shelf OCR or commercial form-processing systems. This paper describes and evaluates a configurable archive construction system, which integrates document image pre-processing and analysis with text post-processing tools and a standard OCR package to meet digital archiving requirements. The prototype system is currently being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera (Butterflies and Moths) and Coleoptera (Beetles) to searchable digital archives. Evaluation results covering different aspects of the system from card scanning to overall word recognition rates for different database fields are summarised for 'two datasets comprising over 5,000 cards selected from different parts of these archives. First-pass end-to-end word recognition rates of 70-90% are reported for key data fields, subject to availability of suitable electronic dictionaries. Further validation and correction is supported through web-editing of the online digital archive.
机译:在许多科学和文化领域中,由纸质档案创建结构化数字图书馆的需求日益增长,既不能通过现成的OCR或商业形式处理系统来满足。本文描述并评估了一个可配置的归档构建系统,该系统将文档图像预处理和分析与文本后处理工具以及标准的OCR软件包集成在一起,以满足数字归档的要求。该原型系统目前正与英国自然历史博物馆一起使用,以帮助将超过500,000张鳞翅目(蝴蝶和飞蛾)和鞘翅目(甲虫)的卡片转换为可搜索的数字档案。总结了涵盖两个系统的评估结果,从卡片扫描到不同数据库领域的整体单词识别率,其中两个数据集包括从这些档案库的不同部分中选择的5,000多个卡片。据报道,关键数据字段的首过端到端单词识别率达到70-90%,这取决于合适的电子词典的可用性。通过在线编辑在线数字档案馆,可以进行进一步的验证和更正。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号