Extracting Person Names from Diverse and Noisy OCR Text

机译：从各种嘈杂的OCR文本中提取人名

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.

机译：应用于扫描的文档和OCRed历史文档的命名实体识别可以有助于发现历史信息。但是，由于存在明显的单词错误和缺少页面布局信息，因此从某些历史语料库识别实体比从本地数字文本识别实体要困难得多。它有多困难？可以预期的质量水平是什么？我们将三种典型的提取算法应用于从主要家谱内容提供者的集合中找到的多种类型的嘈杂OCR文档中提取人名的任务，并使用多种质量指标比较其表现。我们还显示使用三个提取器的多数票合奏，提取质量得到了改善。我们根据两个参考评估提取质量：人类可以从OCR输出和原始文档图像中手动提取的内容。我们说明了从OCRed数据中提取名称的挑战和机遇，并确定了进一步改进的方向。

著录项

来源
《4th workshop on analytics for noisy unstructured text data 2010》|2010年|p.19-26|共8页
会议地点 Toronto(CN);Toronto(CN)
作者
Thomas L. Packer; Joshua F. Lutes; Aaron P. Stewart; David W. Embley; Eric K. Ringger; Kevin D. Seppi; Lee S. Jensen;
展开▼
作者单位

Department of Computer Science Brigham Young University Provo, Utah, USA;

Department of Computer Science Brigham Young University Provo, Utah, USA;

Department of Computer Science Brigham Young University Provo, Utah, USA;

Department of Computer Science Brigham Young University Provo, Utah, USA;

Department of Computer Science Brigham Young University Provo, Utah, USA;

Department of Computer Science Brigham Young University Provo, Utah, USA;

Ancestry.com, Inc. Provo, Utah, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类信息处理（信息加工）;
关键词
information extraction; noisy OCR; named entity recogni-tion; NER; scanned document images; MEMM; CRF;

机译：信息提取；嘈杂的OCR；命名实体识别； NER;扫描的文档图像； MEMM;慢性肾功能衰竭;
入库时间 2022-08-26 14:23:23

相似文献

外文文献
中文文献
专利

1. Transfer learning for Turkish named entity recognition on noisy text [J] . Emre Kagan Akkaya, Burcu Can Natural language engineering . 2021,第Pta1期

机译：在嘈杂的文本上转移土耳其名为实体认可的学习
2. Improving named entity recognition in noisy user-generated text with local distance neighbor feature [J] . Neurocomputing . 2020,第Mara21期

机译：使用本地距离邻居功能改善嘈杂的用户生成文本中的命名实体识别
3. Recognition of Patient-Related Named Entities in Noisy Tele-Health Texts [J] . Kim Mi-Young, Xu Ying, Zaiane Osmar R., ACM transactions on intelligent systems . 2015,第4期

机译：嘈杂的远程医疗文本中与患者相关的命名实体的识别
4. Extracting Person Names from Diverse and Noisy OCR Text [C] . Thomas L. Packer, Joshua F. Lutes, Aaron P. Stewart, Workshop on analytics for noisy unstructured text data . 2010

机译：从不同于和嘈杂的OCR文本中提取人名
5. Probabilistic methods for searching OCR-degraded Arabic text. [D] . Darwish, Kareem M. 2003

机译：用于搜索OCR降级的阿拉伯文本的概率方法。
6. Scene Text Access: A Comparison of Mobile OCR Modalities for Blind Users [O] . Leo Neat, Ren Peng, Siyang Qin, -1

机译：场景文本访问：针对盲用户的移动OCR模式的比较
7. Filtering of Texts Extracted from PDF, OCR or Web [O] . Žigárdi Tomáš 2013

机译：过滤从PDF，OCR或Web提取的文本

Extracting Person Names from Diverse and Noisy OCR Text

摘要

著录项

相似文献

相关主题

期刊订阅