首页> 外文会议>Conference on Computational Natural Language Learning >Alleviating Digitization Errors in Named Entity Recognition for Historical Documents
【24h】

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

机译:减轻名为历史文档的名为实体识别的数字化错误

获取原文

摘要

This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.
机译:本文解决了命名实体识别(NER)的任务,应用于从使用光学字符识别(OCR)技术的报纸的数字图像获得的数字化历史文本。我们认为这项任务的主要挑战是OCR过程导致输出文本中的拼写错误和语言错误。此外,老年文件中可以存在历史变化,这可能会影响ner过程的性能。我们对德国和法语的两个历史数据集进行比较评估,针对以前的最先进的模型,我们提出了一种基于分层堆栈的模型,用于接近历史数据的NER任务。我们的研究结果表明,该模型明确改善了两个历史数据集的结果,并不会降低现代数据集的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号