...
首页> 外文期刊>digital scholarship in the humanities >Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study
【24h】

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

机译:量化肮脏的 OCR 对历史文本分析的影响:十八世纪在线馆藏作为案例研究

获取原文
获取原文并翻译 | 示例

摘要

This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.
机译:本文旨在量化光学字符识别(OCR)对历史文献定量分析的影响。以 Eighteenth Century Collections Online 为案例研究,我们首先探索并解释了 OCR 语料库与其由 Text Creation Partnership 创建的键入对应语料库之间的差异。然后,我们进行了一系列数字人文科学常见的具体分析:主题建模、作者归属、搭配分析和向量空间建模。最后,本文通过反思在不存在真实情况的情况下预测 OCR 质量的潜力,就如何将这些结论应用于其他数据集提供了一些初步想法。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号