Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Hill Mark J.; Hengchen Simon

首页> 外文期刊>Digital scholarship in the humanities >Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

【24h】

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

机译：量化OCR对历史文本分析的影响：18世纪在线收藏作为案例研究

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

机译：本文旨在量化影响光学字符识别（OCR）对历史文献的定量分析。在线在线使用18世纪的集合作为案例研究，首先探索并解释由文本创建合作伙伴关系创建的OCR语料库和其关键的对应物之间的差异。然后，我们对数字人文学科的一系列特定分析进行：主题建模，作者归因，搭配分析和矢量空间建模。本文通过反思预测无论如何是否存在ocr质量的可能性，提供一些关于这些结论如何适用于其他数据集的初步思考。

著录项

来源
《Digital scholarship in the humanities》 |2019年第4期|825-843|共19页
作者
Hill Mark J.; Hengchen Simon;
展开▼
作者单位

Univ Helsinki Dept Digital Humanities COMHIS Helsinki Finland;

Univ Helsinki Dept Digital Humanities COMHIS Helsinki Finland;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive [J] . D-lib magazine . 2009,第15期

机译：测量大量文本数字化的质量和实用性：评估大英图书馆19世纪在线报纸存档的OCR准确性的经验教训
2. HPLC-APCI-MS analysis of triacylglycerols (TAGs) in historical pharmaceutical ointments from the eighteenth century [J] . Saliu F., Modugno F., Orlandi M., Analytical and bioanalytical chemistry . 2011,第6期

机译：HPLC-APCI-MS分析18世纪历史药膏中的三酰基甘油（TAGs）
3. HPLC–APCI-MS analysis of triacylglycerols (TAGs) in historical pharmaceutical ointments from the eighteenth century [J] . Francesco Saliu, Francesca Modugno, Marco Orlandi, Analytical and Bioanalytical Chemistry . 2011,第6期

机译：HPLC-APCI-MS分析18世纪历史药膏中的甘油三酯（TAGs）
4. Lexicon-supported OCR of eighteenth century Dutch books: a case study [C] . Jesse de Does, Katrien Depuydt Document recognition and retrieval XX . 2013

机译：词典支持的18世纪荷兰书籍的OCR：案例研究
5. Conservation of an Eighteenth-century Embroidered Panel in the Graduate Study Collection at the Fashion Institute of Technology [D] . Jaramillo, Alicia. 2020

机译：在时尚理工学院培养十八世纪刺绣小组的研究生学习集合
6. The codification of medical morality: historical and philosophical studies of the formalization of western medical morality in the eighteenth and nineteenth centuries vol. 1 Medical ethics and etiquette in the eighteenth century [O] . Andreas-Holger Maehle 1994

机译：医学道德的编纂：18世纪和19世纪西方医学道德形式化的历史和哲学研究。 1十八世纪的医学伦理和礼节
7. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study [O] . Mark J Hill, Simon Hengchen 2019

机译：量化OCR对历史文本分析的影响：18世纪在线收藏作为案例研究

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅