...
首页> 外文期刊>Turkish Journal of Electrical Engineering and Computer Sciences >Turkish lexicon expansion by using finite state automata
【24h】

Turkish lexicon expansion by using finite state automata

机译:通过使用有限状态自动机进行土耳其语词典扩展

获取原文
           

摘要

Turkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36 %, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.
机译:土耳其语是一种具有多种形态的凝集性语言。土耳其语动词可以有成千上万种不同的单词形式。因此,稀疏性在许多土耳其自然语言处理(NLP)应用程序中成为一个问题。本文介绍了土耳其词典扩展的模型。我们旨在通过使用形态学分割系统来扩展词典,方法是将分割任务转换为生成任务。我们的模型使用有限状态自动机(FSA)来合并正交特征和词法规则。我们通过捕获每当添加后缀时应用于单词的语音操作来提取正交特征。每个FSA状态对应于词干或后缀类别。词干根据其词性(即名词,动词或形容词)进行聚类,后缀根据其同构特征进行聚类。我们仅使用几千个土耳其语词干就生成了大约一百万个字形,准确度为82.36%,这将有助于减少其他NLP应用程序的词汇量。尽管我们的实验是使用土耳其语进行的,但相同的模型也适用于其他凝集性语言,例如匈牙利语和芬兰语。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号