首页> 外文期刊>The Journal of Documentation >Cultural heritage as digital noise: nineteenth century newspapers in the digital archive
【24h】

Cultural heritage as digital noise: nineteenth century newspapers in the digital archive

机译:作为数字噪音的文化遗产:数字档案馆中的19世纪报纸

获取原文
获取原文并翻译 | 示例
       

摘要

Purpose: The purpose of this paper is to explore and analyze the digitized newspaper collection at the National Library of Sweden, focusing on cultural heritage as digital noise. In what specific ways are newspapers transformed in the digitization process? If the digitized document is not the same as the source document - is it still a historical record, or is it transformed into something else? Design/methodology/approach: The authors have analyzed the XML files from Aftonbladet 1830 to 1862. The most frequent newspaper words not matching a high-quality references corpus were selected to zoom in on the noisiest part of the paper. The variety of the interpretations generated by optical character recognition (OCR) was examined, as well as texts generated by auto-segmentation. The authors have made a limited ethnographic study of the digitization process. Findings: The research shows that the digital collection of Aftonbladet contains extreme amounts of noise: millions of misinterpreted words generated by OCR, and millions of texts re-edited by the auto-segmentation tool. How the tools work is mostly unknown to the staff involved in the digitization process? Sticking to any idea of a provenance chain is hence impossible, since many steps have been outsourced to unknown factors affecting the source document. Originality/value: The detail examination of digitally transformed newspapers is valuable to scholars depending on newspaper databases in their research. The paper also highlights the fact that libraries outsourcing digitization processes run the risk of losing control over the quality of their collections.
机译:目的:本文的目的是探索和分析瑞典国家图书馆的数字化报纸收藏,重点是作为数字噪音的文化遗产。报纸在数字化过程中以哪些特定方式进行了转换?如果数字化文档与原始文档不同-它仍然是历史记录,还是转换为其他文档?设计/方法/方法:作者分析了Aftonbladet 1830至1862年的XML文件。选择了与高质量参考语料库不匹配的最频繁的报纸单词,以放大论文的最嘈杂部分。检查了由光学字符识别(OCR)生成的各种解释以及由​​自动分段生成的文本。作者对数字化过程进行了有限的人种学研究。发现:研究表明,Aftonbladet的数字馆藏包含大量噪声:OCR生成了数百万个误解词,并且自动细分工具重新编辑了数百万个文本。这些工具的工作方式几乎是数字化过程中涉及的人员所未知的?因此,不可能遵循任何出处链的想法,因为许多步骤已外包给影响源文档的未知因素。原创性/价值:对数字化报纸的详细检查对于学者来说非常有价值,这取决于他们研究中的报纸数据库。该论文还强调了一个事实,即图书馆外包数字化过程存在失去对其馆藏质量控制的风险。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号