首页> 外文OA文献 >Analyse de grappe des données de catégories et de séquences étude et application à la prédiction de la faillite personnelle
【2h】

Analyse de grappe des données de catégories et de séquences étude et application à la prédiction de la faillite personnelle

机译:类别和序列数据的聚类分析及其在预测个人破产中的应用

摘要

Cluster analysis is one of the most important and useful data mining techniques, and there are many applications of cluster analysis in pattern extraction, information retrieval, summarization, compression and other areas. The focus of this thesis is on clustering categorical and sequence data. Clustering categorical and sequence data is much more challenging than clustering numeric data because there is no inherently meaningful measure of similarity between the categorical objects and sequences. In this thesis, we design novel efficient and effective clustering algorithms for clustering categorical data and sequence respectively, and we perform extensive experiments to demonstrate the superior performance of our proposed algorithm. We also explore the extent to which the use of the proposed clustering algorithms can help to solve the personal bankruptcy prediction problem. Clustering categorical data poses two challenges: defining an inherently meaningful similarity measure, and effectively dealing with clusters which are often embedded in different subspaces. In this thesis, we view the task of clustering categorical data from an optimization perspective and propose a novel objective function. Based on the new formulation, we design a divisive hierarchical clustering algorithm for categorical data, named DHCC. In the bisection procedure of DHCC, the initialization of the splitting is based on multiple correspondence analysis (MCA). We devise a strategy for dealing with the key issue in the divisive approach, namely, when to terminate the splitting process. The proposed algorithm is parameter-free, independent of the order in which the data is processed, scalable to large data sets and capable of seamlessly discovering clusters embedded in subspaces. The prior knowledge about the data can be incorporated into the clustering process, which is known as semi-supervised clustering, to produce considerable improvement in learning accuracy. In this thesis, we view semi-supervised clustering of categorical data as an optimization problem with extra instance-level constraints, and propose a systematic and fully automated approach to guide the optimization process to a better solution in terms of satisfying the constraints, which would also be beneficial to the unconstrained objects. The proposed semi-supervised divisive hierarchical clustering algorithm for categorical data, named SDHCC, is parameter-free, fully automatic and effective in taking advantage of instance-level constraint background knowledge to improve the quality of the resultant dendrogram. Many existing sequence clustering algorithms rely on a pair-wise measure of similarity between sequences. Usually, such a measure is effective if there are significantly informative patterns in the sequences. However, it is difficult to define a meaningful pair-wise similarity measure if sequences are short and contain noise. In this thesis, we circumvent the obstacle of defining the pairwise similarity by defining the similarity between an individual sequence and a set of sequences. Based on the new similarity measure, which is based on the conditional probability distribution (CPD) model, we design a novel model-based K -means clustering algorithm for sequence clustering, which works in a similar way to the traditional K -means on vectorial data. Finally, we develop a personal bankruptcy prediction system whose predictors are mainly the bankruptcy features discovered by the clustering techniques proposed in this thesis. The mined bankruptcy features are represented in low-dimensional vector space. From the new feature space, which can be extended with some existing prediction-capable features (e.g., credit score), a support vector machine (SVM) classifier is built to combine these mined and already existing features. Our system is readily comprehensible and demonstrates promising prediction performance.
机译:聚类分析是最重要和最有用的数据挖掘技术之一,聚类分析在模式提取,信息检索,摘要,压缩和其他领域中有许多应用。本文的重点是聚类分类和序列数据。聚类分类和序列数据比聚类数字数据更具挑战性,因为分类对象和序列之间没有相似性的内在有意义的度量。本文设计了新颖高效的聚类算法,分别对分类数据和序列进行聚类,并进行了广泛的实验,证明了所提算法的优越性能。我们还探讨了使用提出的聚类算法可以在多大程度上帮助解决个人破产预测问题。对分类数据进行聚类提出了两个挑战:定义本质上有意义的相似性度量,以及有效处理通常嵌入在不同子空间中的聚类。本文从优化的角度看待分类数据的聚类任务,并提出了一种新颖的目标函数。基于新的公式,我们设计了用于分类数据的分割分层聚类算法,称为DHCC。在DHCC的对分过程中,分割的初始化基于多重对应分析(MCA)。我们设计了一种解决分歧方法中关键问题的策略,即何时终止拆分过程。所提出的算法是无参数的,与处理数据的顺序无关,可扩展到大型数据集,并且能够无缝地发现嵌入在子空间中的集群。可以将有关数据的先验知识合并到称为半监督聚类的聚类过程中,以显着提高学习准确性。在本文中,我们将分类数据的半监督聚类视为具有额外实例级约束的优化问题,并提出了一种系统的,完全自动化的方法来指导优化过程从满足约束的角度出发寻求更好的解决方案,也有利于不受约束的物体。所提出的用于分类数据的半监督分割分层聚类算法SDHCC是无参数的,全自动的,并且可以充分利用实例级约束背景知识来提高生成的树状图的质量。许多现有的序列聚类算法都依赖于序列之间相似性的成对测量。通常,如果序列中存在明显的信息模式,则这种措施是有效的。但是,如果序列短且包含噪声,则很难定义有意义的成对相似度度量。在本文中,我们通过定义单个序列和一组序列之间的相似性来规避定义成对相似性的障碍。基于基于条件概率分布(CPD)模型的新相似性度量,我们设计了一种基于模型的新颖K均值聚类算法进行序列聚类,其工作原理与矢量上的传统K均值相似。数据。最后,我们开发了一个个人破产预测系统,其预测指标主要是通过本文提出的聚类技术发现的破产特征。开采的破产特征以低维向量空间表示。从可以用一些现有的具有预测功能的功能(例如,信用评分)扩展的新功能空间中,构建了一个支持向量机(SVM)分类器,以将这些挖掘的和现有的功能组合在一起。我们的系统易于理解,并显示出有希望的预测性能。

著录项

  • 作者

    Xiong Tengke;

  • 作者单位
  • 年度 2011
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号