...
首页> 外文期刊>The Journal of Documentation >To stem or lemmatize a highly inflectional language in a probabilistic IR environment?
【24h】

To stem or lemmatize a highly inflectional language in a probabilistic IR environment?

机译:要在概率性IR环境中阻止或限制高度变形的语言?

获取原文
获取原文并翻译 | 示例

摘要

Purpose - To show that stem generation compares well with lemmatization as a morphological tool for a highly inflectional language for IR purposes in a best-match retrieval system.Design/methodology/approach - Effects of three different morphological methods lemmatization, stemming and stem production - for Finnish are compared in a probabilistic IR environment (INQUERY). Evaluation is done using a four-point relevance scale which is partitioned differently in different test settings.Findings - Results show that stem production, a lighter method than morphological lemmatization, compares well with lemmatization in a best-match IR environment. Differences in performance between stem production and lemmatization are small and they are not statistically significant in most of the tested settings. It is also shown that hitherto a rather neglected method of morphological processing for Finnish, stemming, performs reasonably well although the stemmer used - a Porter stemmer implementation - is far from optimal for a morphologically complex language like Finnish. In another series of tests, the effects of compound splitting and derivational expansion of queries are tested.Practical implications - Usefulness of morphological lemmatization and stem generation for IR purposes can be estimated with many factors. On the average P-R level they seem to behave very close to each other in a probabilistic IR system. Thus, the choice of the used method with highly inflectional languages needs to be estimated along other dimensions too.Originality/value - Results are achieved using Finnish as an example of a highly inflectional language. The results are of interest for anyone who is interested in processing of morphological variation of a highly inflected language for IR purposes.
机译:目的-显示词干生成与词根化在最佳匹配检索系统中作为用于IR目的的高屈折语言的形态学工具相比非常合适。设计/方法/方法-三种不同形态学方法词根化,词根和词干生成的影响-在概率IR环境(INQUERY)中对芬兰语进行比较。使用四点相关性量表进行评估,该量表在不同的测试环境中分配不同。结果-结果表明,茎生成是一种比形态学词条简化的方法更轻巧的方法,在最佳匹配的IR环境中与词条形成具有良好的比较。茎部生产和去茎化之间的性能差异很小,并且在大多数测试环境中它们在统计上并不显着。还显示出,迄今为止,尽管芬兰语的词干处理(一种Porter词干器实现)远非最佳的词法复杂语言(如芬兰语),但其在词干处理上的一种相当被忽略的方法表现相当不错。在另一系列测试中,测试了化合物拆分和查询的派生扩展的效果。实际意义-可以通过许多因素来评估形态学词根化和生成茎用于IR的有用性。在概率红外系统中,它们的平均P-R水平似乎非常接近。因此,还需要沿着其他维度来估计使用高屈折语言的方法的选择。原创性/价值-使用芬兰语作为高屈折语言的示例获得了结果。该结果对于对出于IR目的而对高度变形的语言的形态变化进行处理感兴趣的任何人都感兴趣。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号