首页> 外文学位 >Clustering algorithms, classification algorithms and their applications in medical databases.
【24h】

Clustering algorithms, classification algorithms and their applications in medical databases.

机译:聚类算法,分类算法及其在医学数据库中的应用。

获取原文
获取原文并翻译 | 示例

摘要

Data mining is a process of discovering hidden patterns and relationships in large databases using various techniques such as clustering and classification. Clustering is the process of discovering groups of data, such that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized. Many of the partitional clustering algorithms such as PAM, CLARA and CLARANS have failed to identify natural clusters of arbitrary shapes and sizes. It is required by these algorithms to provide the number of clusters in advance, which is very difficult to identify for high dimensional, large data sets. Hierarchal clustering algorithms such as CURE and ROCK have been developed which have overcome the limitations of partitional algorithms. However, these algorithms are based on a static model and hence they have failed to discover natural clusters. CHAMELEON is a hierarchal clustering algorithm that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the relative closeness and relative inter-connectivity between the two clusters are greater than the threshold values. This property ensures that the natural clusters with arbitrary shapes, sizes and densities are identified.; I have implemented CHAMELEON in Maple 8.0 using the C library available in METIS and hMETIS graph partitioning packages and the mining software developed by Dr. Quoc-Nam Tran, Associate Professor, Lamar University. I have achieved a considerable improvement of 35% in the time for executing the program on the benchmark data sets.; Classification, also called supervised clustering, is another data mining technique. In this technique, a model is constructed using a training data set, which is then tested and used for classifying records whose class labels are unknown. I have implemented classification using Gini and Entropy based approaches and applied the program on thrombosis data sets. I have also compared the results of both the approaches and identified interesting rules for classifying thrombosis into any of the four possible categories namely, type 0, type 1, type 2 and type 3.
机译:数据挖掘是使用诸如聚类和分类之类的各种技术在大型数据库中发现隐藏模式和关系的过程。群集是发现数据组的过程,以使群集内相似度最大化,而群集间相似度最小。许多分区聚类算法(例如PAM,CLARA和CLARANS)都无法识别任意形状和大小的自然聚类。这些算法要求预先提供簇的数量,这对于高维,大数据集很难识别。已经开发出克服了分区算法的局限性的分层聚类算法,例如CURE和ROCK。但是,这些算法基于静态模型,因此无法发现自然簇。 CHAMELEON是一种分层聚类算法,它基于动态模型测量两个聚类的相似性。在聚类过程中,仅当两个聚类之间的相对紧密度和相对互连性大于阈值时,才合并两个聚类。此属性可确保识别出具有任意形状,大小和密度的自然簇。我已经使用METIS和hMETIS图分区软件包中的C库以及Lamar大学副教授Quoc-Nam Tran博士开发的挖掘软件在Maple 8.0中实现了CHAMELEON。在基准数据集上执行该程序的时间缩短了35%。分类,也称为监督聚类,是另一种数据挖掘技术。在这种技术中,使用训练数据集构建模型,然后对该模型进行测试并用于对类别标签未知的记录进行分类。我已经使用基于基尼和熵的方法实现了分类,并将该程序应用于血栓形成数据集。我还比较了这两种方法的结果,并确定了将血栓形成分为4类,0类,1类,2类和3类中任何一种的有趣规则。

著录项

  • 作者

    Baddam, Sudheer R.;

  • 作者单位

    Lamar University - Beaumont.;

  • 授予单位 Lamar University - Beaumont.;
  • 学科 Computer Science.
  • 学位 M.S.
  • 年度 2005
  • 页码 62 p.
  • 总页数 62
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号