首页> 外文会议>ACM SIGKDD international conference on Knowledge discovery in data mining >Estimating missed actual positives using independent classifiers
【24h】

Estimating missed actual positives using independent classifiers

机译:使用独立分类器估算错过的实际正数

获取原文

摘要

Data mining is increasingly being applied in environments having very high rate of data generation like network intrusion detection [7], where routers generate about 300,000 -- 500,000 connections every minute. In such rare class data domains, the cost of missing a rare-class instance is much higher than that of other classes. However, the high cost for manual labeling of instances, the high rate at which data is collected as well as real-time response constraints do not always allow one to determine the actual classes for the collected unlabeled datasets. In our previous work [9], this problem of missed false negatives was explained in context of two different domains -- "network intrusion detection" and "business opportunity classification". In such cases, an estimate for the number of such missed high-cost, rare instances will aid in the evaluation of the performance of the modeling technique (e.g. classification) used. A capture-recapture method was used for estimating false negatives, using two or more learning methods (i.e. classifiers). This paper focuses on the dependence between the class labels assigned by such learners. We define the conditional independence for classifiers given a class label and show its relation to the conditional independence of the features sets (used by the classifiers) given a class label. The later is a computationally expensive problem and hence, a heuristic algorithm is proposed for obtaining conditionally independent (or less dependent) feature sets for the classifiers. Initial results of this algorithm on synthetic datasets are promising and further research is being pursued.
机译:数据挖掘正越来越多地用于数据生成速率很高的环境中,例如网络入侵检测[7],在该环境中,路由器每分钟生成约300,000-500,000个连接。在这样的稀有类数据域中,丢失稀有类实例的代价要比其他类高得多。但是,手动标记实例的高成本,数据的高收集率以及实时响应约束并不总是使人们能够确定所收集的未标记数据集的实际类别。在我们以前的工作中[9],在两个不同的领域(“网络入侵检测”和“商机分类”)中解释了漏掉漏报的问题。在这种情况下,对这种错过的高成本,稀有实例的数量进行估计将有助于评估所使用的建模技术(例如分类)的性能。使用捕获-重新捕获方法来估计误报,它使用两种或多种学习方法(即分类器)。本文着重于此类学习者分配的班级标签之间的依存关系。我们为给定类别标签的分类器定义条件独立性,并在给定类别标签的情况下显示其与要素集(由分类器使用)的条件独立性的关系。后者是计算上昂贵的问题,因此,提出了一种启发式算法,用于获得分类器的条件独立(或依赖性较小)的特征集。该算法在合成数据集上的初步结果令人鼓舞,并且正在进一步研究中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号