首页> 外文会议>Asia-Pacific Signal and Information Processing Association Annual Summit and Conference >Khmer word segmentation based on Bi-directional Maximal Matching for Plaintext and Microsoft Word document
【24h】

Khmer word segmentation based on Bi-directional Maximal Matching for Plaintext and Microsoft Word document

机译:基于双向最大匹配的Khmer Word分割,用于明文和Microsoft Word文档

获取原文

摘要

One of major key component in Khmer language processing is how to transform Khmer texts into series of separated Khmer words. But unlike in Latin languages such as English or French; Khmer language does not have any explicit word boundary delimiters such as blank space to separate between each word. Moreover, Khmer language has more complex structure to word form which causes Khmer Unicode standard ordering of character components to permit different orders that lead to the same visual representation; exactly looking word, but different character order. Even more, Khmer word could also be a join of two or more Khmer words together. All these complications address many challenges in Khmer word segmentation to determine word boundaries. Response to these challenges and try to improve level of accuracy and performance in Khmer word segmentation, this paper presents a study on Bidirectional Maximal Matching (BiMM) with Khmer Clusters, Khmer Unicode character order correction, corpus list optimization to reduce frequency of dictionary lookup and Khmer text manipulation tweaks. The study also focuses on how to implement Khmer word segmentation on both Khmer contents in Plaintext and Microsoft Word document. For Word document, the implementation is done on currently active Word document and also on file Word document. The study compares the implementation of Bi-directional Maximal Matching (BiMM) with Forward Maximal Matching (FMM) and Backward Maximal Matching (BMM) and also with similar algorithm from previous study. The result of study is 98.13% on accuracy with time spend of 2.581 seconds for Khmer contents of 1,110,809 characters which is about 160,000 of Khmer words.
机译:高棉语言处理中的主要关键组件之一是如何将Khmer文本转换为一系列分隔的高棉单词。但与英语或法语等拉丁语语言不同; Khmer语言没有任何显式字边界分隔符,如空格到每个单词之间分离。此外,Khmer语言具有更复杂的字体形式,这导致Khmer Unicode标准排序字符组件以允许导致相同的视觉表示的不同订单;正好看,但不同的字符顺序。甚至更多,高棉单位也可以是两个或多个高棉单词的加入。所有这些并发症都解决了Khmer Word分段中的许多挑战,以确定单词边界。对这些挑战的回应,并尝试提高高棉单词分割中的准确性和性能水平,本文提出了对双向最大匹配(BIMM)的研究与高棉群,Khmer Unicode字符校正,语料库列表优化,减少字典查找频率和高棉文本操作调整。该研究还侧重于如何在明文和Microsoft Word文档中实现Khmer内容的Khmer Word分段。对于Word文档,实现是在当前活动的Word文档上完成的,也可以在文件Word文档上完成。该研究比较了双向最大匹配(BIMM)的实施与前向最大匹配(FMM)和后向最大匹配(BMM)以及与先前研究的类似算法。对于高棉含量为1,110,809个字符的时间,研究结果是高精度为2.581秒的准确度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号