首页> 外文会议>International Conference on Knowledge and Smart Technology >Longest Matching and Rule-based Techniques for Khmer Word Segmentation
【24h】

Longest Matching and Rule-based Techniques for Khmer Word Segmentation

机译:最长匹配和基于规则的高棉语分词技术

获取原文

摘要

Word boundaries are the essential assignment to be done in natural language processing research. In most Asian languages, as well as Khmer language, many studies involved with word segmentation have been investigated. In Khmer Word Segmentation, several approaches related to segmenting words based on dictionary have been studied. There are only few researches about solving unknown word problem. This matter is a quite challenge task in word separation. In this research, Maximum Matching algorithm (MMA) together with Rule-based technique has been proposed. First, MMA and a Khmer manual corpus were used to make word boundaries in each sentence. Then the unknown words were then defined and solved by using 21 grammar rules created. We tested the segmentation with 2018 sentences from agriculture, magazine, newspaper, technology, health and history. With Maximum Matching alone, we could achieve the accuracy of 88.55% and along with Rule-based, the accuracy increased to 92.81%.
机译:单词边界是自然语言处理研究中必不可少的任务。在大多数亚洲语言以及高棉语言中,已对许多涉及分词的研究进行了调查。在高棉语单词分割中,研究了与基于字典的单词分割有关的几种方法。解决未知单词问题的研究很少。在分词中,此问题是一项非常艰巨的任务。在这项研究中,提出了最大匹配算法(MMA)和基于规则的技术。首先,使用MMA和高棉手册语料库在每个句子中划分单词边界。然后使用创建的21个语法规则定义和解决未知单词。我们使用来自农业,杂志,报纸,技术,卫生和历史的2018年句子测试了细分。仅使用“最大匹配”,我们就可以达到88.55 \%的准确度,而基于“规则”的准确度则可以提高到92.81 \%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号