首页> 外文会议>IEEE International Congress on Big Data >Query revision during cluster based search on large unstructured corpora
【24h】

Query revision during cluster based search on large unstructured corpora

机译:在大型非结构化语料库上基于集群的搜索期间查询修订版

获取原文

摘要

We investigate a frequently occurring issue in search (retrieval) in the age of big unstructured data. Searches conducted on large unstructured corpora result in long results lists. Such results lists are often clustered and reranked for ease of navigation. Should a query be revised during time-critical examinations of such long cluster based reranked lists? This question arises naturally during early stages of commercially important applications of IR such as eDiscovery, but has not yet been given any research attention. Four factors compound the difficulty of this question in the setting of eDiscovery: (a) the query sources (the technical experts) are different from the legal staff that are actually executing the query and using the retrieval system, (b) the retrieved lists for each query tend to be very long, and (c) the user might be accessing these retrieved results through a clustering interface, and (c) all decisions must be transparent and easy to explain due to the litigious nature of the application. Analogous difficulties arise in other applications involving search over large unstructured corpora. We introduce a framework to help users make the decision of “whether to revise.” Our framework consists of two components. First, we introduce a “limited view” which is a summary of a long cluster-based reranked list. This is the first input to the user. This provides the user a summary of the long cluster-based list. Second, we construct query predictors for this limited view, and provide their prediction as a second input to the user. This prediction is used to corroborate the inspection of the summary limited view. The proposed combination of a limited view and query performance prediction can assist search staff in determining whether to pursue an expensive query revision or not, as well as save precious time by precluding inspections of lists with very few relevant documents during the early stages of commercially important- applications such as eDiscovery.
机译:我们在大非结构化数据时代的搜索(检索)中调查经常发生的问题。在大型非结构化的Corpora上进行的搜索导致长期结果列表。此类结果列表通常是群集和重新登记,以便于导航。如果在这些长群集的重新划分的列表中的时间关键考试期间应该修改查询吗?这个问题在商业上重要应用的早期阶段自然出现,例如Ediscovery,但尚未得到任何研究关注。四个因素复制这个问题的难度在Ediscovery的设置中:(a)查询来源(技术专家)与实际执行查询的法律人员以及使用检索系统的法律员工不同,(b)所检索的列表每个查询往往是很长的,并且(c)用户可能通过聚类界面访问这些检索结果,(c)由于应用程序的诉讼性质,所有决策必须是透明和易于解释的。在其他涉及在大型非结构化的Corpora搜索的其他应用中出现类似的困难。我们介绍一个框架,帮助用户做出“是修改的决定。我们的框架由两个组成部分组成。首先,我们介绍了一个“有限的视图”,它是基于长群集的重新划分的列表的摘要。这是对用户的第一个输入。这为用户提供了基于长群集列表的摘要。其次,我们构建该有限视图的查询预测器,并将其预测作为第二输入提供给用户。该预测用于证实摘要有限视图的检查。所提出的有限视图和查询性能预测的组合可以帮助搜索员工在确定是否追求昂贵的查询修订,以及通过在商业上的早期阶段的相关文件中排除列表的检查来拯救宝贵的时间 - Ediscovery等应用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号