Restricted inflectional form generation in management of morphological keyword variation

Kimmo Kettunen; Eija Airio; Kalervo Jaervelin

首页> 外文期刊>Information retrieval >Restricted inflectional form generation in management of morphological keyword variation

【24h】

Restricted inflectional form generation in management of morphological keyword variation

机译：词形变化管理中限制形变形式的产生

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word form normalization through lemmatization or stemming is a standard procedure in information retrieval because morphological variation needs to be accounted for and several languages are morphologically non-trivial. Lemmatization is effective but often requires expensive resources. Stemming is also effective in most contexts, generally almost as good as lemmatization and typically much less expensive; besides it also has a query expansion effect. However, in both approaches the idea is to turn many inflectional word forms to a single lemma or stem both in the database index and in queries. This means extra effort in creating database indexes. In this paper we take an opposite approach: we leave the database index un-normalized and enrich the queries to cover for surface form variation of keywords. A potential penalty of the approach would be long queries and slow processing. However, we show that it only matters to cover a negligible number of possible surface forms even in morphologically complex languages to arrive at a performance that is almost as good as that delivered by stemming or lemmatization. Moreover, we show that, at least for typical test collections, it only matters to cover nouns and adjectives in queries. Furthermore, we show that our findings are particularly good for short queries that resemble normal searches of web users. Our approach is called FCG (for Frequent Case (form) Generation). It can be relatively easily implemented for Latin/Greek/Cyrillic alphabet languages by examining their (typically very skewed) nominal form statistics in a small text sample and by creating surface form generators for the 3-9 most frequent forms. We demonstrate the potential of our FCG approach for several languages of varying morphological complexity: Swedish, German, Russian, and Finnish in well-known test collections. Applications include in particular Web IR in languages poor in morphological resources.

机译：通过词法归类或词干归一化的词形规范化是信息检索中的标准过程，因为需要考虑形态变化，并且几种语言在形态上并不重要。合法化是有效的，但通常需要昂贵的资源。词干在大多数情况下也是有效的，通常几乎与词根除法一样好，并且通常便宜得多。除了它还具有查询扩展作用。但是，在这两种方法中，其想法都是在数据库索引和查询中都将许多屈折词形式转换为单个引理或词干。这意味着在创建数据库索引时需要付出额外的努力。在本文中，我们采取相反的方法：我们将数据库索引保持未标准化状态，并丰富查询以覆盖关键字的表面形式变化。该方法的潜在代价是查询时间长和处理速度慢。但是，我们表明，即使在形态复杂的语言中，覆盖尽可能少的可能的表面形式也很重要，以达到几乎与词干或词根化所提供的性能一样好的性能。而且，我们表明，至少对于典型的测试集合而言，仅覆盖查询中的名词和形容词才有意义。此外，我们表明，对于类似于网络用户正常搜索的短查询，我们的发现特别有用。我们的方法称为FCG（用于常见案例（表单）生成）。对于拉丁语/希腊语/西里尔字母语言，可以通过在一个小的文本样本中检查其（通常非常偏斜的）名义形式统计数据，并为3-9个最常见的形式创建表面形式生成器，来相对容易地实现。我们在著名的测试集中展示了我们的FCG方法在多种形态复杂程度不同的语言中的潜力：瑞典语，德语，俄语和芬兰语。应用程序尤其包括形态资源不足的语言的Web IR。

著录项

来源
《Information retrieval》 |2007年第5期|415-444|共30页
作者
Kimmo Kettunen; Eija Airio; Kalervo Jaervelin;
展开▼
作者单位

Department of Information Studies, University of Tampere, Tampere 33014, Finland;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类图书馆学、图书馆事业;
关键词
best-match IR; inflected indexes; frequent case form generation for keywords; generative methods in management of keyword variation;

机译：最匹配的IR指数变化;关键字的频繁案例表格生成;关键字变体管理中的生成方法;

相似文献

外文文献
中文文献
专利

1. Towards the Detection and Formal Representation of Semantic Shifts in Inflectional Morphology [J] . Dagmar Gromann, Thierry Declerck OASIcs : OpenAccess Series in Informatics . 2019,第1期

机译：变形形态学语义位移的检测与形式表示
2. Hungarian pronominal case and the dichotomy of content and form in inflectional morphology [J] . Andrew J. Spencer, Gregory T. Stump Natural language & linguistic theory . 2013,第4期

机译：匈牙利代词病例和屈折形态中内容和形式的二分法
3. Hungarian pronominal case and the dichotomy of content and form in inflectional morphology [J] . Andrew J. Spencer, Gregory T. Stump Natural Language & Linguistic Theory . 2013,第4期

机译：匈牙利代词病例和屈折形态中内容和形式的二分法
4. Morphological Inflection Generation with Multi-space Variational Encoder-Decoders [C] . Chunting Zhou, Graham Neubig Conference on computational natural language learning . 2017

机译：具有多空间变码编码器-解码器的形态学变形生成
5. Automatic Parts of Speech Determination in A Morphologically Complex Language: Examining the Role of Phonotactic Information in Processing and Production of Inflectional Morphology =Automatsko odre?ivanje vrsta rije?i u morfolo?ki slo?enom jeziku: Ispitiv [D] . Dimitrijevi?, Strahinja. 2015

机译：以形态学复杂的语言自动分部的语音测定：检查对折射形态的加工和生产中音牙信息的作用=自动确定形态语言中的单词类型：审查员
6. The Production of English Inflectional Morphology Speech Production and Listening Performance in Children with Cochlear Implants [O] . Linda J. Spencer, Nancy Tye-Murray, J. Bruce Tomblin -1

机译：人工耳蜗儿童的英语变形形态学语音表达和听力表现的产生
7. Restricted Inflectional Form Generation in Management of Morphological Keyword Variation [O] . Kimmo Kettunen, Airio Eija, Järvelin Kalervo 2007

机译：词形变化管理中受限的屈折形式生成

Restricted inflectional form generation in management of morphological keyword variation

摘要

著录项

相似文献

相关主题

期刊订阅