首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Unsupervised Learning of Arabic Stemming using a Parallel Corpus
【24h】

Unsupervised Learning of Arabic Stemming using a Parallel Corpus

机译:使用并行语料保图的阿拉伯语干扰的无监督学习

获取原文

摘要

This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. Examples and results will be given for Arabic , but the approach is applicable to any language that needs affix removal. Our resource-frugal approach results in 87.5% agreement with a state of the art, proprietary Arabic stemmer built using rules, affix lists, and human annotated text, in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38% in average precision over unstemmed text, and 96% of the performance of the proprietary stemmer above.
机译:本文介绍了建立非英语(阿拉伯语)Sewermer的无监督学习方法。 Stemming模型基于统计机器翻译,它使用英语Sewmer和一个小(10k句子)并行语料库作为其唯一的培训资源。培训阶段后不需要并行文本。通过允许它适应所需的域或类型,可以使用单声道,未经发布的文本来进一步改进茎秆。将为阿拉伯语提供示例和结果,但该方法适用于任何需要粘贴的语言。除了无监督的组件之外,我们的资源 - 节俭方法与现有技术的达成87.5%,所有使用规则,附件列表和人类注释文本建造的ArabiC Sefalmer。基于任务的评估使用阿拉伯语信息检索表示平均精度的提高22-38%,在随机化文本上平均精度,96%的专有Sewmer性能上方的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号