首页> 外文期刊>Digital scholarship in the humanities >Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study
【24h】

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

机译:量化OCR对历史文本分析的影响:18世纪在线收藏作为案例研究

获取原文
获取原文并翻译 | 示例

摘要

This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.
机译:本文旨在量化影响光学字符识别(OCR)对历史文献的定量分析。在线在线使用18世纪的集合作为案例研究,首先探索并解释由文本创建合作伙伴关系创建的OCR语料库和其关键的对应物之间的差异。然后,我们对数字人文学科的一系列特定分析进行:主题建模,作者归因,搭配分析和矢量空间建模。本文通过反思预测无论如何是否存在ocr质量的可能性,提供一些关于这些结论如何适用于其他数据集的初步思考。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号