...
首页> 外文期刊>The Arabian journal for science and engineering >MODELING ARABIC LANGUAGE USING STATISTICAL METHODS
【24h】

MODELING ARABIC LANGUAGE USING STATISTICAL METHODS

机译:使用统计方法建模阿拉伯语

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In this paper, we propose to investigate statistical language models for Arabic. First, several experiments using different smoothing techniques are carried out on a small corpus extracted from a daily newspaper. The sparseness of the data leads us to investigate other solutions without increasing the size of the corpus. A word segmentation technique has been employed in order to increase the statistical viability of the corpus. An n-morpheme model has been developed which leads to a better performance in terms of normalized perplexity. The second experiment concerns the study of the behavior of statistical models based on different kinds of corpora. The introduction of a distant n-gram improves the baseline model. Finally, we propose a comparative study of statistical language models for Arabic and several foreign languages. The objective of this study is to understand how to better model each of these languages. For foreign languages, trigram models are most appropriate whatever the smoothing technique used. For Arabic, the n-gram models of higher order smoothed with the Witten-Bell method are more efficient.
机译:在本文中,我们建议研究阿拉伯语的统计语言模型。首先,对从日报中提取的一个小语料库进行了几种使用不同平滑技术的实验。数据的稀疏性导致我们在不增加语料库大小的情况下研究其他解决方案。为了增加语料库的统计生存力,已经采用了分词技术。已开发出一个n语素模型,该模型可以在标准化的困惑度方面带来更好的性能。第二个实验涉及对基于不同语料库的统计模型的行为的研究。引入远距n-gram改进了基线模型。最后,我们建议对阿拉伯语和几种外语的统计语言模型进行比较研究。这项研究的目的是了解如何更好地为每种语言建模。对于外语,无论使用哪种平滑技术,三字模型都是最合适的。对于阿拉伯语,使用Witten-Bell方法平滑的高阶n-gram模型更为有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号