首页> 外文会议>Seventh international workshop on health text mining and information analysis >Low-resource OCR error detection and correction in French Clinical Texts
【24h】

Low-resource OCR error detection and correction in French Clinical Texts

机译:法国临床文献中的低资源OCR错误检测和纠正

获取原文
获取原文并翻译 | 示例

摘要

In this paper we present a simple yet effective approach to automatic OCR error detection and correction on a corpus of French clinical reports of variable OCR quality within the domain of foetopathology. While traditional OCR error detection and correction systems rely heavily on external information such as domain-specific lexicons, OCR process information or manually corrected training material, these are not always available given the constraints placed on using medical corpora. We therefore propose a novel method that only needs a representative corpus of acceptable OCR quality in order to train models. Our method uses recurrent neural networks (RNNs) to model sequential information on character level for a given medical text corpus. By inserting noise during the training process we can simultaneously learn the underlying (character-level) language model and as well as learning to detect and eliminate random noise from the textual input. The resulting models are robust to the variability of OCR quality but do not require additional, external information such as lexicons. We compare two different ways of injecting noise into the training process and evaluate our models on a manually corrected data set. We find that the best performing system achieves a 73% accuracy.
机译:在本文中,我们提出了一种简单而有效的方法,可以在伪病理学领域对法国OCR质量可变的临床报告的语料库进行自动OCR错误检测和校正。尽管传统的OCR错误检测和纠正系统严重依赖于外部信息,例如特定领域的词典,OCR过程信息或手动纠正的培训材料,但鉴于使用医疗语料库的限制,这些信息并不总是可用。因此,我们提出了一种新颖的方法,该方法仅需要可接受的OCR质量的代表性语料库即可训练模型。我们的方法使用递归神经网络(RNN)对给定医学文本语料库的字符级别上的顺序信息进行建模。通过在训练过程中插入噪声,我们可以同时学习基础(字符级)语言模型,以及学习检测和消除文本输入中的随机噪声。生成的模型对OCR质量的变化具有鲁棒性,但是不需要其他外部信息(例如词典)。我们比较了在训练过程中注入噪声的两种不同方式,并在手动校正的数据集上评估了我们的模型。我们发现性能最好的系统可达到73%的精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号