【24h】

Mining needle in a haystack

机译:大海捞针

获取原文

摘要

Learning models to classify rarely occurring target classes is an important problem with applications in network intrusion detection, fraud detection, or deviation detection in general. In this paper, we analyze our previously proposed two-phase rule induction method in the context of learning complete and precise signatures of rare classes. The key feature of our method is that it separately conquers the objectives of achieving high recall and high precision for the given target class. The first phase of the method aims for high recall by inducing rules with high support and a reasonable level of accuracy. The second phase then tries to improve the precision by learning rules to remove false positives in the collection of the records covered by the first phase rules. Existing sequential covering techniques try to achieve high precision for each individual disjunct learned. In this paper, we claim that such approach is inadequate for rare classes, because of two problems: splintered false positives and error-prone small disjuncts. Motivated by the strengths of our two-phase design, we design various synthetic data models to identify and analyze the situations in which two state-of-the-art methods, RIPPER and C4.5 rules, either fail to learn a model or learn a very poor model. In all these situations, our two-phase approach learns a model with significantly better recall and precision levels. We also present a comparison of the three methods on a challenging real-life network intrusion detection dataset. Our method is significantly better or comparable to the best competitor in terms of achieving better balance between recall and precision.

机译:通常,学习模型对很少发生的目标类别进行分类是网络入侵检测,欺诈检测或偏差检测应用中的一个重要问题。在本文中,我们在学习稀有类的完整且精确的签名的背景下分析了我们先前提出的两阶段规则归纳方法。我们方法的关键特征是,它可以分别征服给定目标类别的实现高召回率和高精度的目标。该方法的第一阶段旨在通过引入具有高支持度和合理水平的准确性的规则来实现高召回率。然后,第二阶段尝试通过学习规则来消除第一阶段规则涵盖的记录集合中的误报,从而提高准确性。现有的顺序覆盖技术试图针对每个学习到的分离对象实现高精度。在本文中,我们声称这种方法不适用于稀有类,因为存在两个问题:错误的肯定肯定会出现错误,并且容易产生错误的小析取物。受两阶段设计的影响,我们设计了各种综合数据模型,以识别和分析两种最先进的方法(RIPPER和C4.5规则)无法学习模型或无法学习的情况。一个非常糟糕的模型。在所有这些情况下,我们的两阶段方法都可以学习具有明显更好的召回率和精确度水平的模型。我们还提出了具有挑战性的现实网络入侵检测数据集上这三种方法的比较。在召回率和精确度之间实现更好的平衡方面,我们的方法明显优于或优于最佳竞争对手。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号