MODELING ARABIC LANGUAGE USING STATISTICAL METHODS

Karima Meftouh; M. Tayeb Laskri; Kamel Smaili

首页> 外文期刊>The Arabian journal for science and engineering >MODELING ARABIC LANGUAGE USING STATISTICAL METHODS

【24h】

MODELING ARABIC LANGUAGE USING STATISTICAL METHODS

机译：使用统计方法建模阿拉伯语

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we propose to investigate statistical language models for Arabic. First, several experiments using different smoothing techniques are carried out on a small corpus extracted from a daily newspaper. The sparseness of the data leads us to investigate other solutions without increasing the size of the corpus. A word segmentation technique has been employed in order to increase the statistical viability of the corpus. An n-morpheme model has been developed which leads to a better performance in terms of normalized perplexity. The second experiment concerns the study of the behavior of statistical models based on different kinds of corpora. The introduction of a distant n-gram improves the baseline model. Finally, we propose a comparative study of statistical language models for Arabic and several foreign languages. The objective of this study is to understand how to better model each of these languages. For foreign languages, trigram models are most appropriate whatever the smoothing technique used. For Arabic, the n-gram models of higher order smoothed with the Witten-Bell method are more efficient.

机译：在本文中，我们建议研究阿拉伯语的统计语言模型。首先，对从日报中提取的一个小语料库进行了几种使用不同平滑技术的实验。数据的稀疏性导致我们在不增加语料库大小的情况下研究其他解决方案。为了增加语料库的统计生存力，已经采用了分词技术。已开发出一个n语素模型，该模型可以在标准化的困惑度方面带来更好的性能。第二个实验涉及对基于不同语料库的统计模型的行为的研究。引入远距n-gram改进了基线模型。最后，我们建议对阿拉伯语和几种外语的统计语言模型进行比较研究。这项研究的目的是了解如何更好地为每种语言建模。对于外语，无论使用哪种平滑技术，三字模型都是最合适的。对于阿拉伯语，使用Witten-Bell方法平滑的高阶n-gram模型更为有效。

著录项

来源
《The Arabian journal for science and engineering》 |2010年第2c期|p.69-82|共14页
作者
Karima Meftouh; M. Tayeb Laskri; Kamel Smaili;
展开▼
作者单位

Badji Mokhtar University, Computer Science Department, BP 12 23000 Annaba, Algeria;

Badji Mokhtar University, Computer Science Department, BP 12 23000 Annaba, Algeria;

INR1A-LORIA, Parole Team, BP 101 54602 Villers, Les Nancy, France;

展开▼
收录信息美国《科学引文索引》(SCI);
原文格式 PDF
正文语种 eng
中图分类
关键词
language model; morphemes; perplexity; segmentation; smoothing techniques; corpora;

机译：语言模型语素困惑;分割;平滑技术;语料库;

相似文献

外文文献
中文文献
专利

1. Topic Identification by Statistical Methods for Arabic Language [J] . MOURAD ABBAS, DAOUD BERKANI WSEAS Transactions on Computers . 2006,第9期

机译：通过统计方法识别阿拉伯语的主题
2. A comparison of segmentation methods and extended lexicon models for Arabic statistical machine translation [J] . Sasa Hasan, Saab Mansour, Hermann Ney Machine translation . 2012,第1a2期

机译：阿拉伯统计机器翻译的分割方法和扩展词典模型的比较
3. Joint Morphological-Lexical Language Modeling for Processing Morphologically Rich Languages With Application to Dialectal Arabic [J] . Sarikaya R., Afify M., Deng Y., IEEE transactions on audio, speech and language processing . 2008,第7期

机译：形态-词汇联合语言建模，用于处理形态丰富的语言及其在方言阿拉伯语中的应用
4. A new method to construct a statistical model for Arabic language [C] . Sadiqui Ali, Zinedine Ahmed 3rd International IEEE Colloquium on Information Science and Technology . 2014

机译：构建阿拉伯语统计模型的新方法
5. Attitudes of teachers of Arabic as a foreign language toward methods of foreign language teaching. [D] . Seraj, Sami A. 2010

机译：阿拉伯语作为外语的教师对外语教学方法的态度。
6. Huffman and Linear Scanning Methods with Statistical Language Models [O] . Brian Roark, Melanie Fried-Oken, Chris Gibbons -1

机译：统计语言模型的霍夫曼和线性扫描方法
7. Arabic text recognition of printed manuscripts. Efficient recognition of off-line printed Arabic text using Hidden Markov Models, Bigram Statistical Language Model, and post-processing. [O] . Al-Muhtaseb Husni Abdulghani 2010

机译：印刷品的阿拉伯文字识别。使用隐马尔可夫模型，Bigram统计语言模型和后处理可有效识别离线印刷的阿拉伯文本。

MODELING ARABIC LANGUAGE USING STATISTICAL METHODS

摘要

著录项

相似文献

相关主题

期刊订阅