首页> 外文会议>ACM symposium on document engineering >Document Conversion for Cultural Heritage Texts: FrameMaker to HTML Revisited
【24h】

Document Conversion for Cultural Heritage Texts: FrameMaker to HTML Revisited

机译:文化遗产文本的文档转换:Ramemaker重新审视HTML

获取原文

摘要

Many large-scale digitization projects are currently under way that intend to preserve the cultural heritage contained in paper documents (in particular books) and make it available on the Web. Typically OCR is used to produce searchable electronic texts from books. For newer books, approximately from the late 1980s onwards, digital text may already exist in the form of typesetting data. For applications that require a higher level of accuracy than OCR can deliver, the conversion of typesetting data can thus be an alternative to manual keying. In this paper, we describe a tool for converting typesetting data in FrameMaker format to XHTML+CSS developed for a collection of source editions of medieval and early modern documents. Even though the books of the Collection are typeset in good quality and in modern typefaces, OCR is unusable, since the text is in various historical forms of German, French, Italian, Rhaeto-Romanic, and Latin. The conversion of typesetting data produces fully reliable text free from OCR errors and thus also provides a basis for the construction of language resources for the processing of historical texts.
机译:目前正在进行许多大规模的数字化项目,打算保护纸质文件(特别是书籍)中包含的文化遗产,并在网上提供可用。通常,OCR用于从书籍中生产可搜索的电子文本。对于较新的书籍,大约从20世纪80年代后期开始,数字文本可能已经存在于排版数据的形式。对于需要比OCR更高的精度级别的应用程序,可排版数据的转换因此可以是手动键控的替代方案。在本文中,我们描述了一种用于将排版数据转换为FrameMaker格式的排版数据,用于为中世纪和早期现代文档的源版集合开发的XHTML + CSS。即使集合的书籍在质量上的良好和现代字体上排版,OCR也是不可用的,因为文本以各种历史形式的德国,法语,意大利语,rhaeto-Romanic和拉丁文。排版数据的转换产生完全可靠的文本免于OCR错误,因此还提供了用于处理历史文本的语言资源的基础。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号