首页> 外文会议>The semantic web : Research and applications >Creating Digital Resources from Legacy Documents: An Experience Report from the Biosystematics Domain
【24h】

Creating Digital Resources from Legacy Documents: An Experience Report from the Biosystematics Domain

机译:从旧版文档创建数字资源:生物系统学领域的经验报告

获取原文
获取原文并翻译 | 示例

摘要

Digitized legacy document marked up with XML can be used in many ways, e.g., to generate RDF statements about the world described. A prerequisite for doing so is that the document markup is of sufficient quality. Since fully automated markup-generation methods cannot ensure this, manual corrections and cleaning are indispensable. In this paper, we report on our experiences from a digitization and markup project for a large corpus of legacy documents from the biosystematics domain, with a focus on the use of modern tools. The markup created covers both document structure and semantic details. In contrast to previous markup projects reported on in literature, our corpus consists of large publications that comprise many different semantic units, and the documents contain OCR noise and layout artifacts. A core insight is that digitization and automated markup on the one hand and manual cleaning and correction on the other hand should be tightly interleaved, and that tools supporting this integration yield a significant improvement.
机译:用XML标记的数字化旧版文档可以通过多种方式使用,例如,生成有关所描述世界的RDF语句。这样做的先决条件是文档标记必须具有足够的质量。由于全自动标记生成方法无法确保这一点,因此手动校正和清洁是必不可少的。在本文中,我们报告了来自数字化和标记项目的经验,这些项目来自生物系统学领域的大量旧文档,重点是现代工具的使用。创建的标记涵盖了文档结构和语义细节。与文献中报道的以前的标记项目相比,我们的语料库由大型出版物组成,这些出版物包含许多不同的语义单元,并且文档中包含OCR噪声和布局伪像。核心见解是,一方面数字化和自动标记,另一方面人工清洁和校正应该紧密地交织在一起,支持这种集成的工具可以带来显着的进步。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号