Mining needle in a haystack

机译：大海捞针

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Learning models to classify rarely occurring target classes is an important problem with applications in network intrusion detection, fraud detection, or deviation detection in general. In this paper, we analyze our previously proposed two-phase rule induction method in the context of learning complete and precise signatures of rare classes. The key feature of our method is that it separately conquers the objectives of achieving high recall and high precision for the given target class. The first phase of the method aims for high recall by inducing rules with high support and a reasonable level of accuracy. The second phase then tries to improve the precision by learning rules to remove false positives in the collection of the records covered by the first phase rules. Existing sequential covering techniques try to achieve high precision for each individual disjunct learned. In this paper, we claim that such approach is inadequate for rare classes, because of two problems: splintered false positives and error-prone small disjuncts. Motivated by the strengths of our two-phase design, we design various synthetic data models to identify and analyze the situations in which two state-of-the-art methods, RIPPER and C4.5 rules, either fail to learn a model or learn a very poor model. In all these situations, our two-phase approach learns a model with significantly better recall and precision levels. We also present a comparison of the three methods on a challenging real-life network intrusion detection dataset. Our method is significantly better or comparable to the best competitor in terms of achieving better balance between recall and precision.

机译：通常，学习模型对很少发生的目标类别进行分类是网络入侵检测，欺诈检测或偏差检测应用中的一个重要问题。在本文中，我们在学习稀有类的完整且精确的签名的背景下分析了我们先前提出的两阶段规则归纳方法。我们方法的关键特征是，它可以分别征服给定目标类别的实现高召回率和高精度的目标。该方法的第一阶段旨在通过引入具有高支持度和合理水平的准确性的规则来实现高召回率。然后，第二阶段尝试通过学习规则来消除第一阶段规则涵盖的记录集合中的误报，从而提高准确性。现有的顺序覆盖技术试图针对每个学习到的分离对象实现高精度。在本文中，我们声称这种方法不适用于稀有类，因为存在两个问题：错误的肯定肯定会出现错误，并且容易产生错误的小析取物。受两阶段设计的影响，我们设计了各种综合数据模型，以识别和分析两种最先进的方法（RIPPER和C4.5规则）无法学习模型或无法学习的情况。一个非常糟糕的模型。在所有这些情况下，我们的两阶段方法都可以学习具有明显更好的召回率和精确度水平的模型。我们还提出了具有挑战性的现实网络入侵检测数据集上这三种方法的比较。在召回率和精确度之间实现更好的平衡方面，我们的方法明显优于或优于最佳竞争对手。

著录项

来源
《ACM SIGMOD international conference on Management of data》|2001年|P.91-102|共12页
会议地点
作者
Mahesh V. Joshi; Ramesh C. Agarwal; Vipin Kumar; PMahesh V. Joshi; PRamesh C. Agarwal; PVipin Kumar; Sharad Mehrotra;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类各种专用数据库;
关键词

相似文献

外文文献
中文文献
专利

1. Needle in a haystack: an empirical study on mining tags from Flickr user comments [J] . Haijun Zhang, Jingxuan Li, Bin Luo, International journal of infomation technology and management . 2019,第2a3期

机译：针对大海捞针：来自Flickr用户评论的挖掘标签的实证研究
2. When lawyers want the needle in your haystack. Or the whole haystack [J] . Steve Cocheo ABA Banking Journal . 2007,第6期

机译：当律师想把针刺进大海捞针时。或整个干草堆
3. Finding a needle in a haystack of needles: The difficulty of defining a consistently meaningful cytokine signature [J] . McLoughlin Kaitlin C., Ripley Taylor The Journal of Thoracic and Cardiovascular Surgery . 2018,第6期

机译：在针的干草堆中找到针：难以定义一致有意义的细胞因子签名
4. Finding a Needle in a Haystack: Success stories of Data Mining and Machine Learning for Electronic Materials Selection [C] . Gowoon Cheon, E. Dogus Cubuk, Evan Antoniuk, International Symposium on VLSI Technology, Systems and Applications . 2021

机译：在大海捞针中找到针：电子材料选择的数据挖掘和机器学习成功故事
5. Searching for Needles in the Cosmic Haystack [D] . Devine, Thomas Ryan. 2020

机译：在宇宙干草堆中寻找针
6. Mining the bibliome: searching for a needle in a haystack? [O] . Les Grivell 2002

机译：挖掘书目：在大海捞针中寻找针头？
7. Is It Possible to Find Needles in a Haystack? Meta-Analysis of 1000+ MS/MS Files Provided by the Russian Proteomic Consortium for Mining Missing Proteins [O] . Ekaterina Poverennaya, Olga Kiseleva, Ekaterina Ilgisonis, 2020

机译：是否有可能在干草堆中找到针头？俄罗斯蛋白质组织提供的1000+ MS / MS文件的META分析用于采矿缺失蛋白质

Mining needle in a haystack

摘要

著录项

相似文献

相关主题

期刊订阅