首页> 外文会议>7th World Multiconference on Systemics, Cybernetics and Informatics(SCI 2003) vol.5: Computer Science and Engineering: I >Natural Language Processing with Few Computational Linguistic Resources: An Experiment with Automatic Sentence Parsing for Amharic Texts
【24h】

Natural Language Processing with Few Computational Linguistic Resources: An Experiment with Automatic Sentence Parsing for Amharic Texts

机译:具有很少计算语言资源的自然语言处理:阿姆哈拉语文本自动句法分析的实验

获取原文
获取原文并翻译 | 示例

摘要

The amount of work required to start from scratch in developing all aspects of natural language processing for a new language is huge. At the same time there is an urgent need for a variety of applications including local language spell-checkers, word processors, machine translation systems, search engines, etc. For these applications to be developed, the existence of computerized language resources and a well developed framework for research in this area is essential. Tree-banks, Part-of-speech taggers, computerized grammars, lexica, and parsers are all necessary parts of this framework. The study reported in this article describes an attempt to design and implement a prototype of an automatic sentence parser for Amharic text. Amharic is the official government language of Ethiopia and a language for which very few computational linguistic resources exist. To automatically parse sentences, the study used the Inside Outside algorithm with a bottom up chart parsing strategy. The probabilistic context free grammar was used as a grammatical formalism to represent the phrase structure rules of the language. A small sample corpus of 100 four-word sentences was selected from sentences in the language, and has been used to serve as a training and test set In spite of the limited amount of data and other resources available, the experiments show some promising results.
机译:从零开始为新语言开发自然语言处理的各个方面所需的工作量很大。同时,迫切需要各种应用程序,包括本地语言拼写检查器,文字处理器,机器翻译系统,搜索引擎等。要开发这些应用程序,必须具备计算机语言资源并且开发完善这方面的研究框架至关重要。树库,词性标记器,计算机语法,词法分析器和解析器都是该框架的必要组成部分。本文报道的研究描述了为Amharic文本设计和实现自动句子解析器原型的尝试。阿姆哈拉语是埃塞俄比亚的官方政府语言,也是一种很少有计算语言资源的语言。为了自动分析句子,该研究使用了内部外部算法和自底向上的图表分析策略。概率上下文无关语法被用作语法形式主义来表示语言的短语结构规则。从该语言的句子中选择了一个由100个四词句子组成的小样本语料库,该语料库已被用作训练和测试集。尽管可用的数据和其他资源数量有限,但实验显示了一些有希望的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号