This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.
展开▼
机译:本文旨在量化光学字符识别(OCR)对历史文献定量分析的影响。以 Eighteenth Century Collections Online 为案例研究,我们首先探索并解释了 OCR 语料库与其由 Text Creation Partnership 创建的键入对应语料库之间的差异。然后,我们进行了一系列数字人文科学常见的具体分析:主题建模、作者归属、搭配分析和向量空间建模。最后,本文通过反思在不存在真实情况的情况下预测 OCR 质量的潜力,就如何将这些结论应用于其他数据集提供了一些初步想法。
展开▼