首页> 外文学位 >A comparative study of clustering and classification algorithms.
【24h】

A comparative study of clustering and classification algorithms.

机译:聚类和分类算法的比较研究。

获取原文
获取原文并翻译 | 示例

摘要

Clustering and Classification are two of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. Clustering is the process of organizing unlabeled objects into groups of which members are similar in some way. Clustering is a kind of unsupervised learning algorithm. It does not use category labels when grouping objects. In Semi-Supervised clustering, some prior knowledge is available either in the form of labeled data or pair-wise constraints on some of the objects. Classification is a kind of supervised learning algorithm. It is a procedure to assign class labels. A classifier is constructed from the labeled training data using certain classification algorithm, it then will be used to predict the class label of the test data.;In this dissertation, the results of a comprehensive comparative study of three kinds of clustering algorithms including Co-Clustering, Consensus-based Clustering and Semi-supervised Clustering is presented. Through experiments using artificial datasets with different data substructures and UCI data sets, the performance of the three kinds of clustering algorithms was compared and analyzed. A method was proposed to combine a Co-Clustering algorithm and a Semi-supervised Clustering algorithm. A comprehensive comparative study was conducted on three kinds of classification algorithms including Logistic Regression Classifier, Support Vector Machine and Decision Tree. Experiments were carried out using different artificial datasets and UCI data sets to analyze and compare their classification performance. A method using controlled False Discovery Rate was proposed in Logistic Regression Classifier to select important features. A detailed proof was developed to show that controlling False Discovery Rate can be achieved by controlling the related p-value. Experiments were also conducted to compare the classification performance using the proposed feature selection algorithm.;Keywords. Classification, Clustering, Semi-supervised Clustering, Feature Selection.
机译:聚类和分类是两个最常见的数据挖掘任务,在行业和学术界都经常用于数据分类和分析。聚类是将未标记对象组织成组的过程,这些组的成员在某种程度上相似。聚类是一种无监督的学习算法。在对对象进行分组时,它不使用类别标签。在半监督聚类中,可以以标记数据的形式或某些对象上的成对约束的形式获得一些先验知识。分类是一种监督学习算法。这是分配类标签的过程。利用一定的分类算法,从标记的训练数据中构造出一个分类器,然后将其用于预测测试数据的分类标签。本文对三种聚类算法(包括Co-提出了聚类,基于共识的聚类和半监督聚类。通过使用具有不同数据子结构和UCI数据集的人工数据集进行实验,比较和分析了三种聚类算法的性能。提出了一种将联合聚类算法和半监督聚类算法相结合的方法。对Logistic回归分类器,支持向量机和决策树这三种分类算法进行了全面的比较研究。使用不同的人工数据集和UCI数据集进行了实验,以分析和比较它们的分类性能。在Logistic回归分类器中提出了一种使用受控误发现率的方法来选择重要特征。开发了详细的证明以表明可以通过控制相关的p值来控制错误发现率。还进行了实验,以使用提出的特征选择算法比较分类性能。分类,聚类,半监督聚类,特征选择。

著录项

  • 作者

    Huang, Shuqing.;

  • 作者单位

    Tulane University School of Science and Engineering.;

  • 授予单位 Tulane University School of Science and Engineering.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 117 p.
  • 总页数 117
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号