Learning models to classify rarely occurring target classes is an important problem with applications in network intrusion detection, fraud detection, or deviation detection in general. In this paper, we analyze our previously proposed two-phase rule induction method in the context of learning complete and precise signatures of rare classes. The key feature of our method is that it separately conquers the objectives of achieving high recall and high precision for the given target class. The first phase of the method aims for high recall by inducing rules with high support and a reasonable level of accuracy. The second phase then tries to improve the precision by learning rules to remove false positives in the collection of the records covered by the first phase rules. Existing sequential covering techniques try to achieve high precision for each individual disjunct learned. In this paper, we claim that such approach is inadequate for rare classes, because of two problems: splintered false positives and error-prone small disjuncts. Motivated by the strengths of our two-phase design, we design various synthetic data models to identify and analyze the situations in which two state-of-the-art methods, RIPPER and C4.5 rules, either fail to learn a model or learn a very poor model. In all these situations, our two-phase approach learns a model with significantly better recall and precision levels. We also present a comparison of the three methods on a challenging real-life network intrusion detection dataset. Our method is significantly better or comparable to the best competitor in terms of achieving better balance between recall and precision.
机译:针对大海捞针:来自Flickr用户评论的挖掘标签的实证研究
机译:当律师想把针刺进大海捞针时。或整个干草堆
机译:在针的干草堆中找到针:难以定义一致有意义的细胞因子签名
机译:在大海捞针中找到针:电子材料选择的数据挖掘和机器学习成功故事
机译:在宇宙干草堆中寻找针
机译:挖掘书目:在大海捞针中寻找针头?
机译:是否有可能在干草堆中找到针头?俄罗斯蛋白质组织提供的1000+ MS / MS文件的META分析用于采矿缺失蛋白质