【24h】

Query Expansion for Transliterated Text Retrieval

机译:用于音译文本检索的查询扩展

获取原文
获取原文并翻译 | 示例
       

摘要

With Web 2.0, there has been exponential growth in the number of Web users and the volume of Web content. Most of these users are not only consumers of the information but also generators of it. People express themselves here in colloquial languages, but using Roman script (transliteration). These texts are mostly informal and casual, and therefore seldom follow grammar rules. Also, there does not exist any prescribed set of spelling rules in transliterated text. This freedom leads to large-scale spelling variations, which is a major challenge in mixed script information processing. This article studies different existing phonetic algorithms to handle the issue of spelling variation, points out the limitations of them, and proposes a novel phonetic encoding approach with two different flavors in the light of Hindi transliteration. Experiments performed over Hindi song lyrics retrieval in mixed script domain with three different retrieval models show that proposed approaches outperform the existing techniques in a majority of the cases (sometimes statistically significantly) for a number of metrics like nDCG@1, nDCG@5, nDCG@10, MAP, MRR, and Recall.
机译:使用Web 2.0,Web用户数和Web内容的数量已经指数增长。大多数这些用户不仅是信息的消费者,而且是它的消费者。人们在这里表达了语言,但使用罗马脚本(音译)。这些文本主要是非正式和休闲的,因此很少遵循语法规则。此外,在音译文本中不存在任何规定的拼写规则。这种自由导致大规模的拼写变化,这是混合脚本信息处理中的主要挑战。本文研究了不同的现有语音算法来处理拼写变化问题,指出它们的局限性,并提出了一种新的语音编码方法,鉴于印地语音译,具有两个不同的口味。在混合脚本域中检索的实验与三种不同的检索模型,表明,提出的方法优于大多数情况下的现有技术(有时统计学上),因为许多指标,如NDCG @ 1,NDCG @ 5,NDCG @ 10,地图,MRR和召回。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号