首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition
【24h】

Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition

机译:ISARN DHARMA Word Seation使用统计方法指定实体识别

获取原文
获取原文并翻译 | 示例

摘要

In this study, we developed an Isarn Dharma word segmentation system. We mainly focused on solving the word ambiguity and unknown word problems in unsegmented Isarn Dharma text. Ambiguous Isarn Dharma words occur frequently in word construction due to the writing style without tone markers. Thus, words can be interpreted as having different tones and meanings in the same writing text. To overcome these problems, we developed an Isarn Dharma character cluster-(IDCC) based statistical model and affixation and integrated it with the named entity recognition method (IDCC-C-based statistical model and affixation with named entity recognition (NER)). This method integrates the IDCC-based and character-based statistical models to distinguish the word boundaries. The IDCC-based statistical model utilizes the IDCC feature to disambiguate any ambiguous words. The unknown words are handled using the character-based statistical model, based on the character features. In addition, linguistic knowledge is employed to detect the boundaries of a new word based on the construction morphology and NER. In evaluations, we compared the proposed method with various word segmentation methods. The experimental results showed that the proposed method performed slightly better than the other methods when the corpus size increased. Using the test set, the proposed method obtained the best F-measure of 92.19, an F-measure that was better than the IDCC longest matching grouping at 2.85.
机译:在这项研究中,我们开发了一个Isarn Dharma词分割系统。我们主要专注于在未分段的Isarn Dharma文本中解决歧义和未知词问题。由于没有音调标记,暧昧的Isarn Dharma单词频繁发生在字构建中。因此,可以将单词解释为具有相同文本中的不同音调和含义。为了克服这些问题,我们开发了基于ISARN DHARMA字符集群(IDCC)的统计模型和附件,并与命名实体识别方法(基于IDCC-C的统计模型和命名实体识别(ner)的附加物集成。该方法集成了基于IDCC和基于角色的统计模型来区分字边界。基于IDCC的统计模型利用IDCC功能消除任何模糊的单词。根据字符功能,使用基于字符的统计模型处理未知单词。此外,使用语言知识来基于施工形态和网来检测新单词的边界。在评估中,我们将提出的方法与各种词分割方法进行了比较。实验结果表明,当胶质尺寸增加时,所提出的方法比其他方法略好。使用测试集,所提出的方法获得了92.19的最佳F测量值,这是比2.85的IDCC最长匹配分组更好的F-Measol。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号