首页> 外文会议>International Conference on Applications of Natural Language to Information Systems >Division of Spanish Words into Morphemes with a Genetic Algorithm
【24h】

Division of Spanish Words into Morphemes with a Genetic Algorithm

机译:用遗传算法将西班牙语分成语素

获取原文
获取外文期刊封面目录资料

摘要

We discuss an unsupervised technique for determining morpheme structure of words in an inflective language, with Spanish as a case study. For this, we use a global optimization (implemented with a genetic algorithm), while most of the previous works are based on heuristics calculated using conditional probabilities of word parts. Thus, we deal with complete space of solutions and do not reduce it with the risk to eliminate some correct solutions beforehand. Also, we are working at the derivative level as contrasted with the more traditional grammatical level interested only in flexions. The algorithm works as follows. The input data is a wordlist built on the base of a large dictionary or corpus in the given language and the output data is the same wordlist with each word divided into morphemes. First, we build a redundant list of all strings that might possibly be prefixes, suffixes, and stems of me words in the wordlist. Then, we detect possible paradigms in this set and filter out all items from the lists of possible prefixes and suffixes (though not stems) that do not participate in such paradigms. Finally, a subset of those lists of possible prefixes, stems, and suffixes is chosen using the genetic algorithm. The fitness function is based on the ideas of minimum length description, i.e. we choose the minimum number of elements that are necessary for covering all the words. The obtained subset is used for dividing the words from the wordlist. Algorithm parameters are presented. Preliminary evaluation of the experimental results for a dictionary of Spanish is given.
机译:我们讨论了一种无监督的技术,用于在案例研究中用西班牙语确定替补语言中单词的语素结构。为此,我们使用全局优化(用遗传算法实现),而大多数以前的作品基于使用Word部件的条件概率计算的启发式。因此,我们处理完整的解决方案空间,并且不会将其降低,以便预先消除一些正确的解决方案。此外,我们正在衍生水平工作,与屈曲中兴趣的更传统的语法级别形成鲜明对比。该算法如下工作。输入数据是在给定语言的大字典或语料库的基础上构建的字列表,输出数据是与每个单词分为语素的单词列表。首先,我们构建一个可能是WordList中可能是前缀,后缀和茎的所有字符串的冗余列表。然后,我们检测到该集合中可能的范例,并从可能不参与此类范例的可能前缀和后缀(尽管不是茎)的列表中过滤掉所有项目。最后,选择使用遗传算法选择可能前缀,茎和后缀的那些列表的子集。健身功能基于最小长度描述的思想,即,我们选择覆盖所有单词所需的最小元素数。所获得的子集用于将单词从字列表中划分。呈现算法参数。给出了西班牙文字典实验结果的初步评价。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号