首页> 外文会议>4th workshop on analytics for noisy unstructured text data 2010 >Extracting Person Names from Diverse and Noisy OCR Text
【24h】

Extracting Person Names from Diverse and Noisy OCR Text

机译:从各种嘈杂的OCR文本中提取人名

获取原文
获取原文并翻译 | 示例

摘要

Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.
机译:应用于扫描的文档和OCRed历史文档的命名实体识别可以有助于发现历史信息。但是,由于存在明显的单词错误和缺少页面布局信息,因此从某些历史语料库识别实体比从本地数字文本识别实体要困难得多。它有多困难?可以预期的质量水平是什么?我们将三种典型的提取算法应用于从主要家谱内容提供者的集合中找到的多种类型的嘈杂OCR文档中提取人名的任务,并使用多种质量指标比较其表现。我们还显示使用三个提取器的多数票合奏,提取质量得到了改善。我们根据两个参考评估提取质量:人类可以从OCR输出和原始文档图像中手动提取的内容。我们说明了从OCRed数据中提取名称的挑战和机遇,并确定了进一步改进的方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号