首页> 外文学位 >A Relational Framework for Clustering and Cluster Validity and the Generalization of the Silhouette Measure.
【24h】

A Relational Framework for Clustering and Cluster Validity and the Generalization of the Silhouette Measure.

机译:聚类和聚类有效性的关系框架以及轮廓测度的推广。

获取原文
获取原文并翻译 | 示例

摘要

By clustering one seeks to partition a given set of points into a number of clusters such that points in the same cluster are similar and are dissimilar to points in other clusters. In the virtue of this goal, data of relational nature become typical for clustering. The similarity and dissimilarity relations between the data points are supposed to be the nuts and bolts for cluster formation. Thus, the task is driven by the notion of similarity between the data points. In practice, the similarity is usually measured by the pairwise distances between the data points. Indeed, the objective function of the two widely used clustering algorithms, namely, k-means and fuzzy c-means, appears in terms of the pairwise distances between the data points.;The clustering task is complicated by the choice of the distance measure and estimating the number of clusters. Fuzzy c-means is convenient when there are uncertainties in allocating points, in overlapping areas, to clusters. The k-means algorithm allocates the points unequivocally to clusters; overlooking the similarities between those points in overlapping areas. The fuzzy approach allows a point to be a member in as many clusters as necessary; thus it provides better insight into the relations between the points in overlapping areas.;In this thesis we develop a relational framework that is inspired by the silhouette measure of clustering quality. The framework asserts the relations between the data points by means of logical reasoning with the cluster membership values. The original description of computing the silhouettes is limited to crisp partitions. A natural generalization of silhouettes, to fuzzy partitions is given within our framework. Moreover, two notions of silhouettes emerge within the framework at different levels of granularity, namely, point-wise silhouette and center-wise silhouette. Now by the generalization, each silhouette is capable of measuring the extent to which a crisp, or fuzzy, partition has fulfilled the clustering goal at the level of the individual points, or cluster centers. The partitions are evaluated by the silhouette measure in conjunction with point-to-point or center-to-point distances.;By the generalization, the average silhouette value becomes a reasonable device for selecting between crisp and fuzzy partitions of the same data set. Accordingly, one can find about which partition is better in representing the relations between the data points, in accordance with their pairwise distances. Such powerful feature of the generalized silhouettes has exposed a problem with the partitions generated by fuzzy c-means. We have observed that defuzzifying the fuzzy c-means partitions always improves the overall representation of the relations between the data points. This is due to the inconsistency between some of the membership values and the distances between the data points. This inconsistency was reported, by others, in a couple of occasions in real life applications.;Finally, we present an experiment that demonstrates a successful application of the generalized silhouette measure in feature selection for highly imbalanced classification. A significant improvement in the classification for a real data set has resulted from a significant reduction in the number of features.
机译:通过聚类,人们试图将给定的一组点划分为多个聚类,以使同一聚类中的点与其他聚类中的点相似且不相似。凭借该目标,关系性质的数据成为群集的典型数据。数据点之间的相似性和不相似性关系被认为是集群形成的基本要素。因此,任务是由数据点之间的相似性概念驱动的。实际上,相似度通常是通过数据点之间的成对距离来衡量的。确实,两种广泛使用的聚类算法的目标函数,即k均值和模糊c均值,是根据数据点之间的成对距离出现的;聚类任务由于选择了距离度量而变得很复杂。估计群集数。当在将重叠区域中的点分配给聚类时存在不确定性时,模糊c均值非常方便。 k均值算法将点明确分配给聚类;忽略了重叠区域中这些点之间的相似性。模糊方法允许一个点成为所需聚类中的一员。因此,它为重叠区域中的点之间的关系提供了更好的洞察力。;在本文中,我们开发了一种关系框架,该框架受到聚类质量的轮廓度量的启发。该框架通过逻辑推理和集群成员资格值来断言数据点之间的关系。计算轮廓的原始描述仅限于清晰的分区。在我们的框架内,将轮廓自然地概括为模糊分区。此外,轮廓的两个概念在框架内以不同的粒度级别出现,即点向轮廓和中心向轮廓。现在,通过归纳,每个轮廓都能够在单个点或聚类中心的水平上测量清晰或模糊的分区满足聚类目标的程度。通过轮廓测量结合点到点或中心到点的距离来评估分区;通过概括,平均轮廓值成为在同一数据集的明晰和模糊分区之间进行选择的合理设备。因此,可以发现根据它们的成对距离,哪个分区在表示数据点之间的关系方面更好。广义轮廓的这种强大功能暴露了由模糊c均值生成的分区的问题。我们已经观察到,对模糊c均值分区进行模糊化处理总是可以改善数据点之间关系的整体表示。这是由于某些隶属度值与数据点之间的距离不一致。最后,我们提出了一个实验,该实验证明了广义轮廓测度在高度不平衡分类的特征选择中的成功应用。实际数据集分类的显着改善是由于特征数量的显着减少。

著录项

  • 作者

    Rawashdeh, Mohammad Y.;

  • 作者单位

    University of Cincinnati.;

  • 授予单位 University of Cincinnati.;
  • 学科 Computer science.;Information science.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 122 p.
  • 总页数 122
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号