首页> 外文会议>IST archiving conference >The digital documents quality control workflow at the BnF (operation, issue, improvement)
【24h】

The digital documents quality control workflow at the BnF (operation, issue, improvement)

机译:数字文件质量控制工作流程(操作,问题,改进)

获取原文

摘要

"Gallica" the digital library of the French national library is one of the most important digital library in the world. Today there are more than three million digital documents in Gallica. The indexation of its digital documents is done through its textual content obtained thanks to service providers that use Optical Character Recognition softwares (OCR). The robustness of the OCR systems is not always guaranteed since we find in its outputs some defects in the detection results of the document structure (segmentation results) and the recognition of the characters recognition results. One of the frequent errors in OCR outputs is the missed text components. The presence of such defects may leads to inconsistency in the digital libraries. We present in this paper the current document quality control workflow. Next, we propose a method that aims to detect the presence of missed text components without using ground truth. Our verification method uses the local information inside pages based on Radom transform descriptors and Local binary Patterns (LBP) to verify their OCR quality. The proposed approach has show a good performance since it can detect 84.15% of missed text components with a good precision rate 94.73%.
机译:“Gallica”法国国家图书馆的数字图书馆是世界上最重要的数字图书馆之一。今天在Gallica中有超过300万个数字文件。由于使用光学字符识别软件(OCR)的服务提供商,通过它的文本内容进行数字文档的指数。由于我们在文档结构(分段结果)的检测结果中发现其输出存在一些缺陷并识别字符识别结果,因此不始终保证OCR系统的稳健性。 OCR输出中的频繁错误之一是错过的文本组件。这种缺陷的存在可能导致数字图书馆的不一致。我们在本文中存在当前文件质量控制工作流程。接下来,我们提出了一种方法,该方法旨在在不使用地面真理的情况下检测错过的文本组件的存在。我们的验证方法使用基于Radom Transform描述符和本地二进制模式(LBP)的页面内的本地信息来验证其OCR质量。建议的方法表现出良好的性能,因为它可以检测84.15%的错过的文本组件,优化率良好94.73%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号