Unsupervised Learning of Arabic Stemming using a Parallel Corpus

机译：使用并行语料保图的阿拉伯语干扰的无监督学习

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. Examples and results will be given for Arabic , but the approach is applicable to any language that needs affix removal. Our resource-frugal approach results in 87.5% agreement with a state of the art, proprietary Arabic stemmer built using rules, affix lists, and human annotated text, in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38% in average precision over unstemmed text, and 96% of the performance of the proprietary stemmer above.

机译：本文介绍了建立非英语（阿拉伯语）Sewermer的无监督学习方法。 Stemming模型基于统计机器翻译，它使用英语Sewmer和一个小（10k句子）并行语料库作为其唯一的培训资源。培训阶段后不需要并行文本。通过允许它适应所需的域或类型，可以使用单声道，未经发布的文本来进一步改进茎秆。将为阿拉伯语提供示例和结果，但该方法适用于任何需要粘贴的语言。除了无监督的组件之外，我们的资源 - 节俭方法与现有技术的达成87.5％，所有使用规则，附件列表和人类注释文本建造的ArabiC Sefalmer。基于任务的评估使用阿拉伯语信息检索表示平均精度的提高22-38％，在随机化文本上平均精度，96％的专有Sewmer性能上方的性能。

著录项

来源
《Annual meeting of the Association for Computational Linguistics》|2003年||共8页
会议地点
作者
Monica Rogati; Scott McCarley; Yiming Yang; Association for Computational Linguistics; ACL-03;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序语言、算法语言;
关键词

相似文献

外文文献
中文文献
专利

1. Effective Unsupervised Arabic Word Stemming: Towards an Unsupervised Radicals Extraction [J] . Ahmed Khorsi The international arab journal of information technology . 2012,第6期

机译：有效的无监督阿拉伯语词干：实现无监督的自由基提取
2. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics [J] . Singh Jasmeet, Gupta Vishal Knowledge-Based Systems . 2019,第SEPa15期

机译：一种新的基于词典和语料统计的无监督语料库词干提取技术
3. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics [J] . Singh Jasmeet, Gupta Vishal Knowledge-Based Systems . 2019,第Sepa15期

机译：一种使用Lexicon和Corpus统计的基于无监督的语料库的茎秆技术
4. Unsupervised Learning of Arabic Stemming using a Parallel Corpus [C] . Monica Rogati, Scott McCarley, Yiming Yang, Annual meeting of the Association for Computational Linguistics . 2003

机译：使用并行语料保图的阿拉伯语干扰的无监督学习
5. A corpus linguistic analysis of English and Arabic parallel business discourse domains. [D] . Haichour, El Houcine. 1999

机译：对英语和阿拉伯语平行商务话语领域进行语料库语言分析。
6. A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking [O] . Nora Madi, Hend S. Al-Khalifa 2019

机译：A7׳ta：单语阿拉伯语平行语料库中的数据用于语法检查
7. Unsupervised Learning of Arabic Stemming using a Parallel Corpus [O] . Monica Rogati Computer, Monica Rogati 2003

机译：使用平行语料库的阿拉伯语词干的无监督学习

Unsupervised Learning of Arabic Stemming using a Parallel Corpus

摘要

著录项

相似文献

相关主题

期刊订阅