首页> 外文会议>IEEE International Congress on Big Data >Batch-mode active learning for technology-assisted review
【24h】

Batch-mode active learning for technology-assisted review

机译:批处理模式主动学习,用于技术辅助审核

获取原文

摘要

In recent years, technology-assisted review (TAR) has become an increasingly important component of the document review process in litigation discovery. This is fueled largely by dramatic growth in data volumes that may be associated with many matters and investigations. Potential review populations frequently exceed several hundred thousands documents, and document counts in the millions are not uncommon. Budgetary and/or time constraints often make a once traditional linear review of these populations impractical, if not impossible - which made "predictive coding" the most discussed TAR approach in recent years. A key challenge in any predictive coding approach is striking the appropriate balance in training the system. The goal is to minimize the time that Subject Matter Experts spend in training the system, while making sure that they perform enough training to achieve acceptable classification performance over the entire review population. Recent research demonstrates that Support Vector Machines (SVM) perform very well in finding a compact, yet effective, training dataset in an iterative fashion using batch-mode active learning. However, this research is limited. Additionally, these efforts have not led to a principled approach for determining the stabilization of the active learning process. In this paper, we propose and compare several batch-mode active learning methods which are integrated within SVM learning algorithm. We also propose methods for determining the stabilization of the active learning method. Experimental results on a set of large-scale, real-life legal document collections validate the superiority of our method over the existing methods for this task.
机译:近年来,技术辅助审查(TAR)已成为诉讼发现文件审查过程的越来越重要的组成部分。这主要是通过可能与许多事项和调查相关的数据量的巨大增长来燃料。潜在的审查人口经常超过数十万个文件,数百万的文件计数并不少见。预算和/或时间限制通常会使这些人群的传统线性审查不切实际,如果不是不可能的话 - 这使得“预测编码”近年来最讨论的焦油方法。任何预测编码方法的关键挑战都在训练系统时令人挑剔。目标是最大限度地减少主题专家在培训系统方面花费的时间,同时确保他们对整个审查人口进行足够的培训以实现可接受的分类表现。最近的研究表明,支持向量机(SVM)在使用批处理模式活动学习中以迭代方式寻找紧凑,但有效,训练数据集非常好。然而,这项研究有限。此外,这些努力没有导致定义方法来确定主动学习过程的稳定性。在本文中,我们提出并比较了几种在SVM学习算法内集成的批次模式的活动学习方法。我们还提出了确定活性学习方法稳定的方法。实验结果对一组大规模,现实生活法律文件集合验证了我们对此任务的现有方法的方法的优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号