首页> 外文会议>International Conference on Knowledge and Smart Technology >Longest Matching and Rule-based Techniques for Khmer Word Segmentation
【24h】

Longest Matching and Rule-based Techniques for Khmer Word Segmentation

机译:基于匹配和规则的Khmer Word分段技术

获取原文

摘要

Word boundaries are the essential assignment to be done in natural language processing research. In most Asian languages, as well as Khmer language, many studies involved with word segmentation have been investigated. In Khmer Word Segmentation, several approaches related to segmenting words based on dictionary have been studied. There are only few researches about solving unknown word problem. This matter is a quite challenge task in word separation. In this research, Maximum Matching algorithm (MMA) together with Rule-based technique has been proposed. First, MMA and a Khmer manual corpus were used to make word boundaries in each sentence. Then the unknown words were then defined and solved by using 21 grammar rules created. We tested the segmentation with 2018 sentences from agriculture, magazine, newspaper, technology, health and history. With Maximum Matching alone, we could achieve the accuracy of 88.55% and along with Rule-based, the accuracy increased to 92.81%.
机译:Word边界是在自然语言处理研究中进行的基本任务。在大多数亚洲语言以及高棉语言中,已经调查了许多与文字细分涉及的研究。在Khmer Word分割中,已经研究了与基于词典的分割词相关的几种方法。解决未知词问题只有很少的研究。这件事是单词分离中的一个非常挑战的任务。在该研究中,已经提出了最大匹配算法(MMA)以及基于规则的技术。首先,MMA和Khmer手册语料库用于在每个句子中进行单词边界。然后通过使用创建的21个语法规则来定义并解决未知单词。我们用2018年农业,杂志,报纸,技术,健康和历史的句子进行了对细分。最大匹配单独,我们可以达到88.55 %的准确性,并且随着规则的准确性,准确性增加到92.81%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号