首页> 外文OA文献 >From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia.
【2h】

From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia.

机译:从传统编码到Unicode:南亚文字的图形和逻辑原理。

摘要

Much electronic text in the languages of South Asia has been published on the Internet. However, while Unicode has emerged as the favoured encoding system of corpus and computational linguists, most South Asian language data on the web uses one of a wide range of non-standard legacy encodings. This paper describes the difficulties inherent in converting text in these encodings to Unicode. Among the various legacy encodings for South Asian scripts, the most problematic are 8-bit fonts based on graphical principles (as opposed to the logical principles of Unicode). Graphical fonts typically encode several features in ways highly incompatible with Unicode. For instance, half-form glyphs used to construct conjunct consonants are typically separate code points in 8-bit fonts; in Unicode they are represented by the full consonant followed by virama. There are many more such cases. The solution described here is an approach to text conversion based on mapping rules. A small number of generalised rules (plus the capacity for more specialised rules) captures the behaviour of each character in a font, building up a conversion algorithm for that encoding. This system is embedded in a font-mapping program, outputting CES-compliant SGML Unicode. This program, a generalised text-conversion tool, has been employed extensively in corpus-building for South Asian languages.
机译:南亚语言的许多电子文本已在互联网上发布。但是,尽管Unicode已成为语料库和计算语言学家的首选编码系统,但网络上的大多数南亚语言数据都使用多种非标准传统编码中的一种。本文介绍了将这些编码的文本转换为Unicode时固有的困难。在南亚文字的各种旧式编码中,最成问题的是基于图形原理的8位字体(与Unicode的逻辑原理相对)。图形字体通常以与Unicode高度不兼容的方式对多个功能进行编码。例如,用于构造辅音的半形字形通常是8位字体中的单独代码点。在Unicode中,它们由完整的辅音表示,后跟virama。还有更多这样的情况。此处描述的解决方案是一种基于映射规则的文本转换方法。少量的通用规则(加上更专门的规则的功能)捕获了字体中每个字符的行为,从而建立了该编码的转换算法。该系统嵌入在字体映射程序中,输出符合CES的SGML Unicode。该程序是一种通用的文本转换工具,已广泛用于南亚语言的语料库构建。

著录项

  • 作者

    Hardie Andrew;

  • 作者单位
  • 年度 2007
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 入库时间 2022-08-20 20:14:21

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号