首页> 外文会议>International World Wide Web Conference; Edinburgh(GB) >Optimizing Scoring Functions and Indexes for Proximity Search in Type-annotated Corpora
【24h】

Optimizing Scoring Functions and Indexes for Proximity Search in Type-annotated Corpora

机译:在带类型注释的语料库中优化评分函数和索引以进行邻近搜索

获取原文
获取原文并翻译 | 示例

摘要

We introduce a new, powerful class of text proximity queries: find an instance of a given "answer type" (person, place, distance) near "selector" tokens matching given literals or satisfying given ground predicates. An example query is type=distance NEAR Hamburg Munich. Nearness is defined as a flexible, trainable parameterized aggregation function of the selectors, their frequency in the corpus, and their distance from the candidate answer. Such queries provide a key data reduction step for information extraction, data integration, question answering, and other text-processing applications. We describe the architecture of a next-generation information retrieval engine for such applications, and investigate two key technical problems faced in building it. First, we propose a new algorithm that estimates a scoring function from past logs of queries and answer spans. Plugging the scoring function into the query processor gives high accuracy: typically, an answer is found at rank 2-4. Second, we exploit the skew in the distribution over types seen in query logs to optimize the space required by the new index structures required by our system. Extensive performance studies with a 10GB, 2-million document TREC corpus and several hundred TREC queries show both the accuracy and the efficiency of our system. From an initial 4.3GB index using 18,000 types from WordNet, we can discard 88% of the space, while inflating query times by a factor of only 1.9. Our final index overhead is only 20% of the total index space needed.
机译:我们引入了一种新的,功能强大的文本接近度查询类:在与给定文字匹配或满足给定地面谓词的“选择器”标记附近找到给定“答案类型”(人,地点,距离)的实例。示例查询是type = distance NEAR汉堡慕尼黑。邻近度定义为选择器的灵活,可训练的参数化聚合函数,选择器在语料库中的频率以及与候选答案的距离。此类查询为信息提取,数据集成,问题解答和其他文本处理应用程序提供了关键的数据精简步骤。我们描述了用于此类应用的下一代信息检索引擎的体系结构,并研究了构建它时面临的两个关键技术问题。首先,我们提出了一种新算法,该算法可根据过去的查询和回答范围日志估算评分函数。将计分功能插入查询处理器可以提高准确性:通常,在2-4级别找到答案。其次,我们利用查询日志中看到的类型分布的偏斜来优化系统所需的新索引结构所需的空间。通过使用10GB,200万个文档的TREC语料库和数百个TREC查询进行的广泛性能研究显示了我们系统的准确性和效率。从最初使用WordNet的18,000种类型的4.3GB索引开始,我们可以丢弃88%的空间,而查询时间却只增加了1.9倍。我们最终的索引开销仅为所需总索引空间的20%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号