首页> 外文期刊>Information retrieval >Restricted inflectional form generation in management of morphological keyword variation
【24h】

Restricted inflectional form generation in management of morphological keyword variation

机译:词形变化管理中限制形变形式的产生

获取原文
获取原文并翻译 | 示例
           

摘要

Word form normalization through lemmatization or stemming is a standard procedure in information retrieval because morphological variation needs to be accounted for and several languages are morphologically non-trivial. Lemmatization is effective but often requires expensive resources. Stemming is also effective in most contexts, generally almost as good as lemmatization and typically much less expensive; besides it also has a query expansion effect. However, in both approaches the idea is to turn many inflectional word forms to a single lemma or stem both in the database index and in queries. This means extra effort in creating database indexes. In this paper we take an opposite approach: we leave the database index un-normalized and enrich the queries to cover for surface form variation of keywords. A potential penalty of the approach would be long queries and slow processing. However, we show that it only matters to cover a negligible number of possible surface forms even in morphologically complex languages to arrive at a performance that is almost as good as that delivered by stemming or lemmatization. Moreover, we show that, at least for typical test collections, it only matters to cover nouns and adjectives in queries. Furthermore, we show that our findings are particularly good for short queries that resemble normal searches of web users. Our approach is called FCG (for Frequent Case (form) Generation). It can be relatively easily implemented for Latin/Greek/Cyrillic alphabet languages by examining their (typically very skewed) nominal form statistics in a small text sample and by creating surface form generators for the 3-9 most frequent forms. We demonstrate the potential of our FCG approach for several languages of varying morphological complexity: Swedish, German, Russian, and Finnish in well-known test collections. Applications include in particular Web IR in languages poor in morphological resources.
机译:通过词法归类或词干归一化的词形规范化是信息检索中的标准过程,因为需要考虑形态变化,并且几种语言在形态上并不重要。合法化是有效的,但通常需要昂贵的资源。词干在大多数情况下也是有效的,通常几乎与词根除法一样好,并且通常便宜得多。除了它还具有查询扩展作用。但是,在这两种方法中,其想法都是在数据库索引和查询中都将许多屈折词形式转换为单个引理或词干。这意味着在创建数据库索引时需要付出额外的努力。在本文中,我们采取相反的方法:我们将数据库索引保持未标准化状态,并丰富查询以覆盖关键字的表面形式变化。该方法的潜在代价是查询时间长和处理速度慢。但是,我们表明,即使在形态复杂的语言中,覆盖尽可能少的可能的表面形式也很重要,以达到几乎与词干或词根化所提供的性能一样好的性能。而且,我们表明,至少对于典型的测试集合而言,仅覆盖查询中的名词和形容词才有意义。此外,我们表明,对于类似于网络用户正常搜索的短查询,我们的发现特别有用。我们的方法称为FCG(用于常见案例(表单)生成)。对于拉丁语/希腊语/西里尔字母语言,可以通过在一个小的文本样本中检查其(通常非常偏斜的)名义形式统计数据,并为3-9个最常见的形式创建表面形式生成器,来相对容易地实现。我们在著名的测试集中展示了我们的FCG方法在多种形态复杂程度不同的语言中的潜力:瑞典语,德语,俄语和芬兰语。应用程序尤其包括形态资源不足的语言的Web IR。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号