首页> 外文会议>Document Recognition and Retrieval XII >A Study of Style Effects on OCR Errors in the MEDLINE Database
【24h】

A Study of Style Effects on OCR Errors in the MEDLINE Database

机译:MEDLINE数据库中样式对OCR错误的影响的研究

获取原文
获取原文并翻译 | 示例

摘要

The National Library of Medicine has developed a system for the automatic extraction of data from scanned journal articles to populate the MEDLINE database. Although the 5-engine OCR. system used in this process exhibits good performance overall, it does make errors in character recognition that must be corrected in order for the process to achieve the requisite accuracy. The correction process works by feeding words that have characters with less than 100% confidence (as determined automatically by the OCR engine) to a human operator who then must manually verify the word or correct the error. The majority of these errors are contained in the affiliation information zone where the characters are in italics or small fonts. Therefore only affiliation information data is used in this research. This paper examines the correlation between OCR errors and various character attributes in the MEDLINE database, such as font size, italics, bold, etc. and OCR confidence levels. The motivation for this research is that if a correlation between the character style and types of errors exists it should be possible to use this information to improve operator productivity by increasing the probability that the correct word option is presented to the human editor. We have determined that this correlation exists, in particular for the case of characters with diacritics.
机译:国家医学图书馆开发了一种系统,用于从扫描的期刊文章中自动提取数据以填充MEDLINE数据库。虽然是5引擎OCR。此过程中使用的系统总体上表现出良好的性能,它确实会造成字符识别错误,必须进行纠正才能使过程达到所需的精度。纠正过程的工作原理是,将具有少于100%置信度(由OCR引擎自动确定)的字符的单词提供给人工操作员,该操作员然后必须手动验证该单词或纠正错误。这些错误的大部分包含在从属关系信息区域中,在该区域中,字符以斜体或小字体显示。因此,本研究仅使用隶属信息数据。本文研究了MEDLINE数据库中OCR错误与各种字符属性之间的相关性,例如字体大小,斜体,粗体等以及OCR置信度。进行这项研究的动机是,如果字符样式和错误类型之间存在相关性,则应该有可能使用此信息,通过增加向人工编辑者提供正确单词选项的可能性来提高操作员的生产率。我们已经确定存在这种关联,尤其是对于带有变音符号的字符而言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号