首页> 外文会议>Document recognition and retrieval XXII >Software Workflow for the Automatic Tagging of Medieval Manuscript Images (SWATI)
【24h】

Software Workflow for the Automatic Tagging of Medieval Manuscript Images (SWATI)

机译:中世纪手稿图像自动标记的软件工作流程(SWATI)

获取原文
获取原文并翻译 | 示例

摘要

Digital methods, tools and algorithms are gaining in importance for the analysis of digitized manuscript collections in the arts and humanities. One example is the BMBF-funded research project "eCodicology" which aims to design, evaluate and optimize algorithms for the automatic identification of macro- and micro-structural layout features of medieval manuscripts. The main goal of this research project is to provide better insights into high-dimensional datasets of medieval manuscripts for humanities scholars. The heterogeneous nature and size of the humanities data and the need to create a database of automatically extracted reproducible features for better statistical and visual analysis are the main challenges in designing a workflow for the arts and humanities. This paper presents a concept of a workflow for the automatic tagging of medieval manuscripts. As a starting point, the workflow uses medieval manuscripts digitized within the scope of the project "Virtual Scriptorium St. Matthias". Firstly, these digitized manuscripts are ingested into a data repository. Secondly, specific algorithms are adapted or designed for the identification of macro- and micro-structural layout elements like page size, writing space, number of lines etc. And lastly, a statistical analysis and scientific evaluation of the manuscripts groups are performed. The workflow is designed generically to process large amounts of data automatically with any desired algorithm for feature extraction. As a result, a database of objectified and reproducible features is created which helps to analyze and visualize hidden relationships of around 170,000 pages. The workflow shows the potential of automatic image analysis by enabling the processing of a single page in less than a minute. Furthermore, the accuracy tests of the workflow on a small set of manuscripts with respect to features like page size and text areas show that automatic and manual analysis are comparable. The usage of a computer cluster will allow the highly performant processing of large amounts of data. The software framework itself will be integrated as a service into the DARIAH infrastructure to make it adaptable for wider range of communities.
机译:数字方法,工具和算法在分析艺术和人文领域的数字化手稿中变得越来越重要。一个例子是由BMBF资助的研究项目“ eCodicology”,该项目旨在设计,评估和优化用于自动识别中世纪手稿的宏观和微观结构布局特征的算法。该研究项目的主要目的是为人文学者提供对中世纪手稿的高维数据集的更好的见解。人文科学数据的异质性和规模以及创建自动提取可复制特征的数据库以进行更好的统计和视觉分析的需求,是为艺术和人文科学设计工作流程的主要挑战。本文介绍了自动标记中世纪手稿的工作流程的概念。首先,工作流使用在项目“ Virtual Scriptorium St. Matthias”范围内数字化的中世纪手稿。首先,将这些数字化的手稿吸收到数据存储库中。其次,为识别宏观和微观结构布局元素(例如页面大小,书写空间,行数等)而修改或设计了特定算法。最后,对手稿组进行了统计分析和科学评估。工作流一般设计为使用任何所需的特征提取算法自动处理大量数据。结果,创建了一个客观且可复制的特征数据库,该数据库有助于分析和可视化大约170,000页的隐藏关系。通过在不到一分钟的时间内处理单个页面,工作流显示了自动图像分析的潜力。此外,针对一小组手稿针对页面大小和文本区域等功能对工作流进行的准确性测试表明,自动和手动分析具有可比性。计算机集群的使用将允许对大量数据进行高性能处理。该软件框架本身将作为服务集成到DARIAH基础架构中,以使其适用于更广泛的社区。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号