Unicode Canonical Decomposition for Hangeul Syllables in Regular Expression

Hee Yuan TAN; Hyotaek LIM

首页> 外文期刊>IEICE transactions on information and systems >Unicode Canonical Decomposition for Hangeul Syllables in Regular Expression

【24h】

Unicode Canonical Decomposition for Hangeul Syllables in Regular Expression

机译：正则表达式中韩文音节的Unicode标准分解

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Owing to the high expressiveness of regular expression, it is frequently used in searching and manipulation of text based data. Regular expression is highly applicable in processing Latin alphabet based text, but the same cannot be said for Hangeul~(*), the writing system for Korean language. Although Hangeul possesses alphabetic features within the script, expressiveness of regular expression pattern using Hangeul is hindered by the absence of syllable decomposition. Without decomposition support in regular expression, searching through Hangeul text is limited to string literal matching. Literal matching has made enumeration of syllable candidates in regular expression pattern definition indispensable, albeit impractical, especially for a large set of syllable candidates. Although the existing implementation of canonical decomposition in Unicode standard does reduce a pre-composed Hangeul syllable into smaller unit of consonant-vowel or consonant-vowel-consonant letters, it still leaves quite a number of the individual letters in compounded form. We have observed that there is a necessity to further reduce the compounded letters into unit of basic letters to properly represent the Korean script in regular expression. We look at how the new canonical decomposition technique proposed by Kim can help in handling Hangeul in regular expression. In this paper, we examine several of the performance indicators of full decomposition of Hangeul syllable to better understand the overhead that might incur, if a full decomposition were to be implemented in a regular expression engine. For efficiency considerations, we propose a semi decomposition technique alongside with a notation for defining Hangeul syllables. The semi decomposition functions as an enhancement to the existing regular expression syntax by taking in some of the special constructs and features of the Korean language. This proposed technique intends to allow an end user to have a greater freedom to define regular expression syntax for Hangeul.

机译：由于正则表达式的高表达能力，它经常用于基于文本的数据的搜索和操作。正则表达式非常适用于处理基于拉丁字母的文本，但是对于韩文书写系统Hangeul〜（*）则不能说正则表达式。尽管韩文在脚本中具有字母特征，但是由于缺少音节分解，因此妨碍了使用韩文的正则表达式模式的表达。在正则表达式中没有分解支持的情况下，通过Hangeul文本进行搜索仅限于字符串文字匹配。字面匹配使枚举正则表达式模式定义中的音节候选变得不可或缺，尽管这是不切实际的，尤其是对于大量的音节候选而言。尽管Unicode标准中规范分解的现有实现方式确实将预先组合的Hangeul音节还原为较小的辅音元音或辅音元音辅音字母单位，但仍留下大量复合形式的单个字母。我们已经观察到有必要将复合字母进一步减少为基本字母的单位，以正确表示正则表达式中的韩文脚本。我们看看Kim提出的新规范分解技术如何帮助处理正则表达式中的Hangeul。在本文中，我们检查了Hangeul音节完全分解的一些性能指标，以更好地了解如果要在正则表达式引擎中实现完全分解，可能会产生的开销。出于效率考虑，我们提出了一种半分解技术以及一种用于定义韩文音节的符号。通过分解韩语的一些特殊结构和特征，半分解功能可以增强现有正则表达式的语法。提出的技术旨在允许最终用户拥有更大的自由度来为Hangeul定义正则表达式语法。

著录项

来源
《IEICE transactions on information and systems》 |2011年第1期|共9页
作者
Hee Yuan TAN; Hyotaek LIM;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词
入库时间 2022-08-18 08:35:30

相似文献

外文文献
中文文献
专利

1. Unicode Canonical Decomposition for Hangeul Syllables in Regular Expression [J] . Hee Yuan TAN, Hyotaek LIM IEICE Transactions on Information and Systems . 2011,第1期

机译：正则表达式中韩文音节的Unicode标准分解
2. Optimising unicode regular expression evaluation with previews [J] . Chivers Howard Software . 2017,第5期

机译：使用预览优化Unicode正则表达式评估
3. Regular neighbourhoods and canonical decompositions for groups [J] . Peter Scott, Gadde A. Swarup Asterisque . 2003,第289期

机译：群体的规则邻域和规范分解
4. Compressing Regular Expressions' DFA Table by Matrix Decomposition [C] . Yanbing Liu, Li Guo, Ping Liu, CIAA 2009;International conference on implementation and application of automata . 2011

机译：通过矩阵分解压缩正则表达式的DFA表
5. Beyond regular: Pattern matching with extended regular expressions. [D] . Carle, Benjamin. 2010

机译：超越正则：与扩展正则表达式匹配的模式。
6. Vocal Communication With Canonical Syllables Predicts Later Expressive Language Skills in Preschool-Aged Children With Autism Spectrum Disorder [O] . Jena McDaniel, Tiffany Woynaroski, Bahar Keceli-Kaysili, -1

机译：与规范音节的人声交流可以预测自闭症谱系障碍学龄前儿童的后期表达语言技能
7. Optimising Unicode Regular Expression Evaluation with Previews [O] . Chivers Howard Robert 2016

机译：使用预览优化Unicode正则表达式评估
8. Generation of an Output Regular Expression of a Sequential Machine with a Specified Input Regular Expression [R] . Yau, S. S. 1966

机译：具有指定输入正则表达式的顺序机器的输出正则表达式的生成

Unicode Canonical Decomposition for Hangeul Syllables in Regular Expression

摘要

著录项

相似文献

相关主题

期刊订阅