首页> 外文会议>International conference of the CLEF initiative >A Case Study in Decompounding for Bengali Information Retrieval
【24h】

A Case Study in Decompounding for Bengali Information Retrieval

机译:分解孟加拉语信息检索的案例研究

获取原文

摘要

Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. The standard approach of decompounding for IR, i.e. indexing compound parts (constituents) in addition to compound words, has proven beneficial for European languages. Our experiments reported in this paper show that such a standard approach does not work particularly well for Bengali IR. Some unique characteristics of Bengali compounds are: ⅰ) only one compound constituent may be a valid word in contrast to the stricter requirement of both being so; and ⅱ) the first character of the right constituent can be modified by the rules of Sandhi in contrast to simple concatenation. As a solution, we firstly propose a more relaxed decompounding where a compound word is decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform seiective decompounding by ensuring that constituents often co-occur with the compound word, which indicates how related the constituents and the compound are. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition method, improving mean average precision (MAP) up to 2.72% and recall up to 1.8%, compared to not decompounding words.
机译:已经发现,对复合语言(例如荷兰语,德语或芬兰语)进行复合处理可以提高信息检索(IR)的效率。但是,以前没有关于印度语IR中化合物分解作用的研究。在此案例研究中,我们调查了对印度语高度凝集的孟加拉语进行复合的效果。事实证明,对IR进行分解的标准方法,即对复合词之外的复合部分(组成部分)进行索引,对欧洲语言是有益的。我们在本文中报道的实验表明,这种标准方法对孟加拉国IR效果不佳。孟加拉语化合物的一些独特特征是:ⅰ)与更严格的要求相反,只有一种化合物成分可能是有效的单词; ⅱ)与简单串联相比,可以通过Sandhi规则修改正确组成部分的第一个字符。作为解决方案,我们首先提出一种更为宽松的分解方法,即如果另一个构成词不是有效词,则仅将一个复合词分解为一个构成词;其次,通过确保构成词经常与该复合词同时出现来进行选择性分解,这表明了成分和化合物之间的相关性。我们对2008年至2012年的FIRE孟加拉国临时红外集合进行了实验。我们的实验表明,与基于频率的标准分解方法相比,松弛分解和基于共现的成分选择都更加有效,从而提高了平均平均精度( MAP)最多可识别2.72%,而回想率最高可达到1.8%(而不是分解单词)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号