首页> 外文会议>IEEE international conference on data engineering >Finding common ground among experts' opinions on data clustering: With applications in malware analysis
【24h】

Finding common ground among experts' opinions on data clustering: With applications in malware analysis

机译:在数据群集的专家意见中找到共同点:在恶意软件分析中的应用

获取原文

摘要

Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.
机译:数据聚类是知识发现和数据挖掘的基本技术。随着数据量大的显着增长,数据群集变得计算上的禁止和资源要求,有时必须将这些任务外包给专门从事数据聚类的第三方专家。这项工作的目标是开发在专家对数据群集的看法中找到共同点的技术,这可能由于聚类中使用的特征或算法而被偏见。我们的工作与现有的达成群集的大量方法不同,因为我们不需要将所有数据对象分组为集群。相反,我们的工作是由真实世界的应用程序的推动,这些应用程序需要高信任数据对象的信心 - 如果被选中,则在一起进行分组。我们将其分组。我们严格制定问题并显示它是NP-CLEATION。我们进一步开发了一种轻量级技术,基于在3均匀的超图中找到最大独立集,以选择在专家意见中不形成冲突的数据对象。我们将建议的方法应用于具有数十万个实例的现实世界恶意软件数据集,以基于多个主要AV(防病毒)软件如何对这些样本进行分类的恶意软件集群。我们的作品通过在集群质量与选择群集的数据对象之间的平衡来提供共识群集的新方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号