Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Hill Mark J.; Hengchen Simon

首页> 外文期刊>digital scholarship in the humanities >Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

【24h】

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

机译：量化肮脏的 OCR 对历史文本分析的影响：十八世纪在线馆藏作为案例研究

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
引文网络
相关主题

摘要

This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

机译：本文旨在量化光学字符识别（OCR）对历史文献定量分析的影响。以 Eighteenth Century Collections Online 为案例研究，我们首先探索并解释了 OCR 语料库与其由 Text Creation Partnership 创建的键入对应语料库之间的差异。然后，我们进行了一系列数字人文科学常见的具体分析：主题建模、作者归属、搭配分析和向量空间建模。最后，本文通过反思在不存在真实情况的情况下预测 OCR 质量的潜力，就如何将这些结论应用于其他数据集提供了一些初步想法。

著录项

来源
《digital scholarship in the humanities》 |2019年第4期|825-843|共19页
作者
Hill Mark J.; Hengchen Simon;
展开▼
作者单位

Univ Helsinki, Dept Digital Humanities, COMHIS, Helsinki, Finland;

展开▼
收录信息
原文格式 PDF
正文语种英语
中图分类
关键词
COLLOCATIONS;

机译：搭配;

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

摘要

著录项

引文网络

相关主题

期刊订阅