首页> 外文学位 >Mining for evidence in enterprise corpora.
【24h】

Mining for evidence in enterprise corpora.

机译:在企业语料库中挖掘证据。

获取原文
获取原文并翻译 | 示例

摘要

The primary research aim of this dissertation is to identify the strategies that best meet the information retrieval needs as expressed in the "e-discovery" scenario. This task calls for a high-recall system that, in response to a request for all available relevant documents to a legal complaint, effectively prioritizes documents from an enterprise document collection in order of likelihood of relevance. High recall information retrieval strategies, such as those employed for e-discovery and patent or medical literature searches, reflect high costs when relevant documents are missed, but they also carry high document review costs.;Our approaches parallel the evaluation opportunities afforded by the TREC Legal Track. Within the ad hoc framework, we propose an approach that includes query field selection, techniques for mitigating OCR error, term weighting strategies, query language reduction, pseudo-relevance feedback using document metadata and terms extracted from documents, merging result sets, and biasing results to favor documents responsive to lawyer-negotiated queries. We conduct several experiments to identify effective parameters for each of these strategies.;Within the relevance feedback framework, we use an active learning approach informed by signals from collected prior relevance judgments and ranking data. We train a classifier to prioritize the unjudged documents retrieved using different ad hoc information retrieval techniques applied to the same topic. We demonstrate significant improvements over heuristic rank aggregation strategies when choosing from a relatively small pool of documents. With a larger pool of documents, we validate the effectiveness of the merging strategy as a means to increase recall, but that sparseness of judgment data prevents effective ranking by the classifier-based ranker.;We conclude our research by optimizing the classifier-based ranker and applying it to other high recall datasets. Our concluding experiments consider the potential benefits to be derived by modifying the merged runs using methods derived from social choice models. We find that this technique, Local Kemenization, is hampered by the large number of documents and the minimal number of contributing result sets to the ranked list. This two-stage approach to high-recall information retrieval tasks continues to offer a rich set of research questions for future research.
机译:本文的主要研究目的是确定最能满足“电子发现”场景中表达的信息检索需求的策略。此任务需要一个高召回率的系统,该系统可响应对法律投诉的所有可用相关文档的请求,以相关可能性的顺序有效地对企业文档收集中的文档进行优先级排序。高召回率的信息检索策略(例如用于电子发现和专利或医学文献搜索的信息检索策略)在丢失相关文档时反映出高昂的成本,但同时也带来了高昂的文档审阅成本。我们的方法与TREC提供的评估机会平行法律轨道。在临时框架内,我们提出一种方法,其中包括查询字段选择,缓解OCR错误的技术,术语加权策略,查询语言简化,使用文档元数据和从文档中提取的术语的伪相关反馈,合并结果集以及偏差结果支持响应律师协商的查询的文件。我们进行了一些实验来确定每种策略的有效参数。在相关性反馈框架内,我们使用主动学习方法,该方法是从收集的先前相关性判断和排名数据中获取信号的。我们训练一个分类器,以区分使用适用于同一主题的不同即席信息检索技术检索的未判断文档的优先级。从相对较小的文档库中进行选择时,我们证明了对启发式排名聚合策略的重大改进。有了更多的文档库,我们验证了合并策略作为增加召回率的一种方法的有效性,但是判断数据的稀疏阻碍了基于分类器的排名的有效排名。并将其应用于其他高召回率数据集。我们的结论性实验考虑了使用源自社会选择模型的方法修改合并后的运行所获得的潜在利益。我们发现,这种技术(本地Kemenization)受到大量文档和对排序列表的最小贡献结果集的阻碍。这种针对高召回率信息检索任务的两阶段方法继续为将来的研究提供了一系列丰富的研究问题。

著录项

  • 作者

    Almquist, Brian Alan.;

  • 作者单位

    The University of Iowa.;

  • 授予单位 The University of Iowa.;
  • 学科 Information technology.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 150 p.
  • 总页数 150
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号