首页> 外文会议>IEEE international conference on data engineering >Finding common ground among experts' opinions on data clustering: With applications in malware analysis
【24h】

Finding common ground among experts' opinions on data clustering: With applications in malware analysis

机译:在专家对数据群集的意见中找到共同点:借助恶意软件分析中的应用程序

获取原文
获取外文期刊封面目录资料

摘要

Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.
机译:数据聚类是用于知识发现和数据挖掘的基本技术。随着数据量的显着增长,数据聚类在计算上变得无用且需要大量资源,有时有必要将这些任务外包给专门从事数据聚类的第三方专家。这项工作的目的是开发一种技术,以在专家关于数据聚类的观点中找到共同点,这些观点可能会因聚类中使用的功能或算法而有所偏差。我们的工作不同于现有的用于共识聚类的方法,因为我们不需要将所有数据对象都分组到聚类中。相反,我们的工作是受现实世界中的应用程序启发的,这些应用程序要求对如何将数据对象(如果已选择)分组在一起具有高度的信心。我们严格地阐述了问题并表明它是NP完全的。我们进一步发展了一种轻量级技术,该技术基于在3个均匀超图中找到最大独立集来选择不会在专家意见之间形成冲突的数据对象。我们将提出的方法应用于具有数十万个实例的真实世界的恶意软件数据集,以根据多种主要AV(反病毒)软件对这些样本进行分类的方式来查找恶意软件集群。通过在聚类质量和选择要聚类的数据对象数量之间取得平衡,我们的工作为共识聚类提供了新的方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号