Khmer word segmentation based on Bi-directional Maximal Matching for Plaintext and Microsoft Word document

机译：基于双向最大匹配的Khmer Word分割，用于明文和Microsoft Word文档

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

One of major key component in Khmer language processing is how to transform Khmer texts into series of separated Khmer words. But unlike in Latin languages such as English or French; Khmer language does not have any explicit word boundary delimiters such as blank space to separate between each word. Moreover, Khmer language has more complex structure to word form which causes Khmer Unicode standard ordering of character components to permit different orders that lead to the same visual representation; exactly looking word, but different character order. Even more, Khmer word could also be a join of two or more Khmer words together. All these complications address many challenges in Khmer word segmentation to determine word boundaries. Response to these challenges and try to improve level of accuracy and performance in Khmer word segmentation, this paper presents a study on Bidirectional Maximal Matching (BiMM) with Khmer Clusters, Khmer Unicode character order correction, corpus list optimization to reduce frequency of dictionary lookup and Khmer text manipulation tweaks. The study also focuses on how to implement Khmer word segmentation on both Khmer contents in Plaintext and Microsoft Word document. For Word document, the implementation is done on currently active Word document and also on file Word document. The study compares the implementation of Bi-directional Maximal Matching (BiMM) with Forward Maximal Matching (FMM) and Backward Maximal Matching (BMM) and also with similar algorithm from previous study. The result of study is 98.13% on accuracy with time spend of 2.581 seconds for Khmer contents of 1,110,809 characters which is about 160,000 of Khmer words.

机译：高棉语言处理中的主要关键组件之一是如何将Khmer文本转换为一系列分隔的高棉单词。但与英语或法语等拉丁语语言不同; Khmer语言没有任何显式字边界分隔符，如空格到每个单词之间分离。此外，Khmer语言具有更复杂的字体形式，这导致Khmer Unicode标准排序字符组件以允许导致相同的视觉表示的不同订单;正好看，但不同的字符顺序。甚至更多，高棉单位也可以是两个或多个高棉单词的加入。所有这些并发症都解决了Khmer Word分段中的许多挑战，以确定单词边界。对这些挑战的回应，并尝试提高高棉单词分割中的准确性和性能水平，本文提出了对双向最大匹配（BIMM）的研究与高棉群，Khmer Unicode字符校正，语料库列表优化，减少字典查找频率和高棉文本操作调整。该研究还侧重于如何在明文和Microsoft Word文档中实现Khmer内容的Khmer Word分段。对于Word文档，实现是在当前活动的Word文档上完成的，也可以在文件Word文档上完成。该研究比较了双向最大匹配（BIMM）的实施与前向最大匹配（FMM）和后向最大匹配（BMM）以及与先前研究的类似算法。对于高棉含量为1,110,809个字符的时间，研究结果是高精度为2.581秒的准确度。

著录项

来源
《Asia-Pacific Signal and Information Processing Association Annual Summit and Conference》|2014年||共9页
会议地点
作者
Bi Narin; Taing Nguonly;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机的应用;
关键词
natural language processing; pattern matching; text analysis; word processing; BMM; BiMM; FMM; Khmer Unicode character order correction; Khmer clusters; Khmer language processing; Khmer word segmentation; Microsoft word document; backward maximal matching; bi-directional maximal matching; corpus list optimization; forward maximal matching; plaintext document; Abstracts; Accuracy; Bidirectional control; Decision support systems; Standards; Transforms; Visualization; Backward Maximal Matching; Bi-directional Maximal Matching; Forward Maximal Matching; Khmer Cluster; Khmer Unicode; Word Segmentation;

机译：自然语言处理;模式匹配;文本分析;文字处理;BMM;BMM;FMM;Khmer Unicode字符顺序校正;高棉语言处理;高棉语言分割;Microsoft Word文档;向后最大匹配;双向最大匹配;双向最大匹配;双向最大匹配;双向最大匹配;双向最大匹配;双向最大匹配;双向最大匹配;双向最大匹配;语料库列表优化;前向最大匹配;明文文档;摘要;准确性;双向控制;决策支持系统;标准;转换;可视化;向后的最大匹配;前向最大匹配;高棉群;Khmer Unicode;Khmer Unicode;Word Seation;

相似文献

外文文献
中文文献
专利

1. Segmentation Free Word Spotting for Handwritten Documents Using Bag of Visual Words Based on Co-HOG Descriptor [J] . Prabhakar C. J., Thontadari C. International journal of information retrieval research . 2019,第2期

机译：基于Co-HOG描述符的视觉词袋对手写文档的自由分割
2. Segmentation Free Word Spotting for Handwritten Documents Using Bag of Visual Words Based on Co-HOG Descriptor [J] . Prabhakar C. J., Thontadari C. International journal of information retrieval research . 2019,第2期

机译：使用基于CO-HOG描述符的袋子视觉单词的手写文件分割免费单词斑点
3. Method of Word Segmentation in Laos Based on Maximal Matching of Syllables [J] . Wenjie Huo, Lanjiang Zhou, Feng Zhou, MATEC Web of Conferences . 2017,第1期

机译：基于音节最大匹配的老挝分词方法
4. Khmer word segmentation based on Bi-directional Maximal Matching for Plaintext and Microsoft Word document [C] . Bi Narin, Taing Nguonly Asia-Pacific Signal and Information Processing Association Annual Summit and Conference . 2014

机译：基于双向最大匹配的纯文本和Microsoft Word文档高棉语分词
5. Information retrieval for Khmer documents: Challenges and approaches to word segmentation. [D] . Tum, Phylypo. 2007

机译：高棉语文件的信息检索：分词的挑战和方法。
6. BioWord: A sequence manipulation suite for Microsoft Word [O] . Laura J Anzaldi, Daniel Muñoz-Fernández, Ivan Erill 2012

机译：BioWord：Microsoft Word的序列操作套件
7. Method of Word Segmentation in Laos Based on Maximal Matching of Syllables [O] . Wenjie Huo, Lanjiang Zhou, Feng Zhou, 2017

机译：基于音节的最大匹配的老挝词分割方法

Khmer word segmentation based on Bi-directional Maximal Matching for Plaintext and Microsoft Word document

摘要

著录项

相似文献

相关主题

期刊订阅