首页> 外文会议>Annual midwest instruction and computing symposium >Exploring Alternative Clustering for PIY Source Code Detection
【24h】

Exploring Alternative Clustering for PIY Source Code Detection

机译:探索PIY源代码检测的替代聚类

获取原文

摘要

In this paper, we compare different clustering algorithms given a specific type of data set. Clustering is a powerful tool because it distributes data into meaningful groups based on the information found in data sets. Given a set of data points, each defined by a set of attributes, we find clusters such that points in one cluster are more similar to one another and less similar to points in other clusters. These groups of clusters are crucial to how data is analyzed. It helps us easily identify and give meaning to certain data according to their traits. Clustering helps handle a data set with more utility thus the study of techniques for finding the most representative cluster model is vital in knowledge extraction. Previous work done by Anthony Ohmann and Professor Imad Rahal propose a scalable system called PIY (Program It Yourself) that can detect source code plagiarism over a large repository of submissions where new submissions are compared to current ones. By using clusters, one can compare a new submission to a subset of the data. Accuracy and time are both important factors for PIY. Therefore, we base efficiency of clustering on accuracy and time. In this paper, we perform an analysis of K-Harmonic Means (KHM) against one of PIY's current clustering algorithms called K-Medoid. Developed by Dr. Bin Zhang, the KHM algorithm is derived from the K-Means and Harmonic Average algorithm. It is known to be more "robust" than the K-Means algorithm. Our goal is to find which algorithm gives us the most favorable results.
机译:在本文中,我们比较给定数据集的特定类型的不同聚类算法。集群是一个强大的工具,因为它的数据分配到基于信息有意义组数据集合中。给定一组数据点,每个由一组属性的定义,我们发现簇,使得在一个簇中的点更类似于彼此少类似于其他簇分。这些组群的是如何分析数据的关键。它可以帮助我们轻松识别,并根据他们的特点赋予意义对某些数据。聚类有助于处理更多的效用因而技术寻找最有代表性的集群模式研究的数据集在知识提取至关重要。安东尼Ohmann和伊马德拉哈尔教授做以前的工作提出了一个名为PIY(计划动手)可扩展的系统,它可以在一个大的库提交新的地方提交的相比,目前一检测源代码抄袭。通过使用集群,一个可以在新的提交比较数据的一个子集。精度和时间对于PIY的重要因素。因此,我们立足于聚类精度和时间效率。在本文中,我们对所谓的K-Medoid的PIY目前的聚类算法进行一个K-调和均值(KHM)的分析。由张斌博士开发的,KHM算法从K均值和调和平均数的算法得出。它被称为是比K-means算法更“稳健”。我们的目标是寻找一种算法为我们提供了最有利的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号