Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

机译：减轻名为历史文档的名为实体识别的数字化错误

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.

机译：本文解决了命名实体识别（NER）的任务，应用于从使用光学字符识别（OCR）技术的报纸的数字图像获得的数字化历史文本。我们认为这项任务的主要挑战是OCR过程导致输出文本中的拼写错误和语言错误。此外，老年文件中可以存在历史变化，这可能会影响ner过程的性能。我们对德国和法语的两个历史数据集进行比较评估，针对以前的最先进的模型，我们提出了一种基于分层堆栈的模型，用于接近历史数据的NER任务。我们的研究结果表明，该模型明确改善了两个历史数据集的结果，并不会降低现代数据集的结果。

著录项

来源
《Conference on Computational Natural Language Learning》|2020年|431-441|共11页
会议地点
作者
Emanuela Boros; Ahmed Hamdi; Elvys Linhares Pontes; Luis Adrian Cabrera-Diego; Jose G. Moreno; Nicolas Sidere; Antoine Doucet;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Recognition-based Segmentation for Digitization of Korean Historical Document Pages [J] . Kyu-Tae Cho, Jin-Hyung Kim 電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding . 2006,第376期

机译：基于识别的韩国历史文献页面数字化分割
2. Named Entity Recognition in Vietnamese documents based on CRF [J] . Vo Trung Hung American Journal of Engineering Research . 2020,第5期

机译：基于CRF的越南文档中的名为实体识别
3. Named entity recognition for Chinese judgment documents based on BiLSTM and CRF [J] . Wenming Huang, Dengrui Hu, Zhenrong Deng, EURASIP journal on image and video processing . 2020,第1期

机译：基于Bilstm和CRF的中国判断文件命名实体识别
4. Named Entity Recognition of Indian Origin Names in English Documents [C] . Chaitanya Gupta, Deepanshu Sood, Mahua Bhattacharya Proceedings of the 2013 international conference on information amp; knowledge engineering . 2013

机译：英文文件中印度起源名称的命名实体识别
5. Named entity recognition and an application to document clustering [D] . Wei, Gang 2004

机译：命名实体识别及其在文档聚类中的应用
6. Comparison of named entity recognition methodologies in biomedical documents [O] . Hye-Jeong Song, Byeong-Cheol Jo, Chan-Young Park, 2018

机译：生物医学文献中命名实体识别方法的比较
7. A Method to Detect Errors in Electronic Discharge Summaries Based on Named Entity Recognition [O] . D.S. Yuan, T.S. Zhou, Y. Tian, 2015

机译：一种检测基于命名实体识别的电子放电摘要错误的方法

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

摘要

著录项

相似文献

相关主题

期刊订阅