【24h】

Splitting compounds with ngrams

机译:用ngram拆分化合物

获取原文

摘要

Compound words with unmarked word boundaries are problematic for many tasks in NLP and computational linguistics, including information extraction, machine translation, and syllabification. This paper introduces a simple, proof-of-concept language modeling approach to automatic compound segmentation, demonstrated with Finnish. The approach utilizes an off-the-shelf morphological analyzer to split training words into their constituent morphemes. A language model is subsequently trained on ngrams composed of morphemes, morpheme boundaries, and word boundaries. Finally, linguistic constraints are used to weed out phonotactically ill-formed segmentations, thereby allowing the language model to select the best grammatical segmentation. This approach achieves an accuracy of ~97%.
机译:对于NLP和计算语言学中的许多任务(包括信息提取,机器翻译和音节识别),具有未标记词边界的复合词是有问题的。本文介绍了一种简单的概念验证语言建模方法来进行自动复合细分,并用Finnish进行了演示。该方法利用现成的词法分析器将训练词分解为它们的构成语素。随后在由词素,词素边界和单词边界组成的ngram上训练语言模型。最后,使用语言约束来消除音位不整齐的分割,从而允许语言模型选择最佳的语法分割。这种方法可达到约97%的精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号