首页> 外文会议>International Conference of the Cross-Language Evaluation Forum >A Case Study in Decompounding for Bengali Information Retrieval
【24h】

A Case Study in Decompounding for Bengali Information Retrieval

机译:孟加拉信息检索分解的案例研究

获取原文
获取外文期刊封面目录资料

摘要

Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. The standard approach of decompounding for IR, i.e. indexing compound parts (constituents) in addition to compound words, has proven beneficial for European languages. Our experiments reported in this paper show that such a standard approach does not work particularly well for Bengali IR. Some unique characteristics of Bengali compounds are: i) only one compound constituent may be a valid word in contrast to the stricter requirement of both being so; and ii) the first character of the right constituent can be modified by the rules of Sandhi in contrast to simple concatenation. As a solution, we firstly propose a more relaxed decompounding where a compound word is decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform selective decompounding by ensuring that constituents often co-occur with the compound word, which indicates how related the constituents and the compound are. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition method, improving mean average precision (MAP) up to 2.72% and recall up to 1.8%, compared to not decompounding words.
机译:已经发现分解改善了荷兰语,德语或芬兰等复合语言的信息检索(IR)有效性。然而,以前没有关于印度语言的IR中化合物分解的影响。在这种情况下,我们调查了孟加拉的分解,这是一种高度凝聚的印度语言的效果。用于IR的二散化的标准方法,即索引复合部分(成分)除了复合词之外,已被证明有利于欧洲语言。我们的实验在本文中报告显示,这种标准方法对孟加拉IR不起作用。孟加拉化合物的一些独特特征是:i)只有一个化合物组成部分可能是一个有效的词,与它们的更严格的要求相反; II)与简单的连接相比,Sandhi规则可以修改右组成部分的第一个特征。作为一种解决方案,我们首先提出了一种更轻松的分解,如果其他成分不是有效的话,则复合词只分解为一个组成部分,其次是通过确保复合字经常发生的成分来执行选择性分解,这表明了组分和化合物的关系。我们在2008年至2012年的Fire中对孟加拉ad-hoc IR系列进行实验。我们的实验表明,缓解分解和基于共同发生的组成选择比标准频率的分解方法更有效,提高平均平均精度(地图)与不分解的单词相比,映射高达2.72%,最高可达1.8%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号