首页> 外文期刊>IEICE transactions on information and systems >Unicode Canonical Decomposition for Hangeul Syllables in Regular Expression
【24h】

Unicode Canonical Decomposition for Hangeul Syllables in Regular Expression

机译:正则表达式中韩文音节的Unicode标准分解

获取原文
       

摘要

Owing to the high expressiveness of regular expression, it is frequently used in searching and manipulation of text based data. Regular expression is highly applicable in processing Latin alphabet based text, but the same cannot be said for Hangeul~(*), the writing system for Korean language. Although Hangeul possesses alphabetic features within the script, expressiveness of regular expression pattern using Hangeul is hindered by the absence of syllable decomposition. Without decomposition support in regular expression, searching through Hangeul text is limited to string literal matching. Literal matching has made enumeration of syllable candidates in regular expression pattern definition indispensable, albeit impractical, especially for a large set of syllable candidates. Although the existing implementation of canonical decomposition in Unicode standard does reduce a pre-composed Hangeul syllable into smaller unit of consonant-vowel or consonant-vowel-consonant letters, it still leaves quite a number of the individual letters in compounded form. We have observed that there is a necessity to further reduce the compounded letters into unit of basic letters to properly represent the Korean script in regular expression. We look at how the new canonical decomposition technique proposed by Kim can help in handling Hangeul in regular expression. In this paper, we examine several of the performance indicators of full decomposition of Hangeul syllable to better understand the overhead that might incur, if a full decomposition were to be implemented in a regular expression engine. For efficiency considerations, we propose a semi decomposition technique alongside with a notation for defining Hangeul syllables. The semi decomposition functions as an enhancement to the existing regular expression syntax by taking in some of the special constructs and features of the Korean language. This proposed technique intends to allow an end user to have a greater freedom to define regular expression syntax for Hangeul.
机译:由于正则表达式的高表达能力,它经常用于基于文本的数据的搜索和操作。正则表达式非常适用于处理基于拉丁字母的文本,但是对于韩文书写系统Hangeul〜(*)则不能说正则表达式。尽管韩文在脚本中具有字母特征,但是由于缺少音节分解,因此妨碍了使用韩文的正则表达式模式的表达。在正则表达式中没有分解支持的情况下,通过Hangeul文本进行搜索仅限于字符串文字匹配。字面匹配使枚举正则表达式模式定义中的音节候选变得不可或缺,尽管这是不切实际的,尤其是对于大量的音节候选而言。尽管Unicode标准中规范分解的现有实现方式确实将预先​​组合的Hangeul音节还原为较小的辅音元音或辅音元音辅音字母单位,但仍留下大量复合形式的单个字母。我们已经观察到有必要将复合字母进一步减少为基本字母的单位,以正确表示正则表达式中的韩文脚本。我们看看Kim提出的新规范分解技术如何帮助处理正则表达式中的Hangeul。在本文中,我们检查了Hangeul音节完全分解的一些性能指标,以更好地了解如果要在正则表达式引擎中实现完全分解,可能会产生的开销。出于效率考虑,我们提出了一种半分解技术以及一种用于定义韩文音节的符号。通过分解韩语的一些特殊结构和特征,半分解功能可以增强现有正则表达式的语法。提出的技术旨在允许最终用户拥有更大的自由度来为Hangeul定义正则表达式语法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号