首页> 外文OA文献 >Efficient density-based methods for knowledge discovery in databases
【2h】

Efficient density-based methods for knowledge discovery in databases

机译:基于密度的高效知识发现方法

摘要

Today's data storage facilities allow recording of billions of transactions from business applications, scientific sensor readings, monitoring systems etc. Scientists developing new drugs, system administrators monitoring complex technical processes, and decision makers being responsible for complex social or technical systems require an overview and even a deeper understanding of their respective data. The knowledge discovery in databases (KDD) process has been designed to identify hidden patterns in large data resources. A central step of the KDD process is the data mining task. Major data mining tasks are clustering and classification. Density-based approaches have proven to be very effective for many data mining methods. However, the good effectiveness often comes at the cost of a high runtime complexity. This thesis presents new efficient density-based approaches for different data mining applications whereas the effectiveness of the new developed methods is always kept in mind. The first part of this thesis is concerned with new density-based clustering methods. Clustering is a data mining task for summarizing data such that similar objects are grouped together while dissimilar ones are separated. Density-based approaches have shown to successfully mine arbitrary shaped clusters even in the presence of noise. In multi-dimensional or high dimensional data, clusters are typically hidden by irrelevant attributes and do not show across the full space. As relevance of attributes is not globally uniform for all clusters, global dimensionality reduction approaches are not adequate. Subspace clustering aims at automatically detecting clusters and their relevant attribute projections. This work presents a new clustering model DUSC which guarantees a comparable and redundancy free subspace clustering result. As the number of possible subspaces is exponential in the number of dimensions subspace clustering is a computationally challenging task. The algorithm eDUSC developed in this work is based on a filter-and-refinement architecture which avoids repeated database scans. Further on, this work proposes a new visualization technique for subspace clusters and a specialized clustering technique for multi-dimensional sequence databases. The second part of this thesis proposes new density-based methods for classification. Classification aims at assigning a class label to unknown objects. Various approaches for classifying objects have been investigated in the last decades. Classifiers based on statistical approaches have been most intensively studied in the literature and results like asymptotical behavior and classification bias have been derived. To apply statistical classifiers the density of objects has to be estimated. In this work, a hierarchy of density estimators is proposed which makes the classification of objects possible anytime. Additionally, a new classification method using subspace clusters for higher dimensionalities is developed in this thesis. The proposed density-based clustering and classification methods are evaluated in terms of both efficiency and effectiveness in thorough experiments on real world and synthetic data.
机译:当今的数据存储设施可以记录来自业务应用程序,科学传感器读数,监控系统等的数十亿笔交易。科学家们在开发新药,监控复杂技术过程的系统管理员以及负责复杂社会或技术系统的决策者甚至需要概览,甚至更深入地了解他们各自的数据。数据库中的知识发现(KDD)流程旨在识别大型数据资源中的隐藏模式。 KDD流程的中心步骤是数据挖掘任务。主要的数据挖掘任务是聚类和分类。基于密度的方法已被证明对许多数据挖掘方法非常有效。但是,良好的效果通常是以较高的运行时复杂性为代价的。本文针对不同的数据挖掘应用提出了一种新的基于密度的有效方法,而始终牢记新方法的有效性。本文的第一部分是关于新的基于密度的聚类方法。聚类是一项数据挖掘任务,用于汇总数据,以便将相似的对象组合在一起,而将不相似的对象分离。基于密度的方法显示出即使在存在噪声的情况下也可以成功地挖掘任意形状的星团。在多维或高维数据中,聚类通常被不相关的属性隐藏,并且不会在整个空间中显示。由于属性的相关性对于所有聚类而言并不是全局统一的,因此全局降维方法并不足够。子空间聚类旨在自动检测聚类及其相关的属性投影。这项工作提出了一种新的聚类模型DUSC,该模型保证了可比较且无冗余的子空间聚类结果。由于可能的子空间的数量在维数中是指数的,因此子空间聚类是一项计算难题。这项工作中开发的eDUSC算法基于过滤和优化架构,可避免重复进行数据库扫描。进一步,这项工作提出了一种用于子空间聚类的新可视化技术和一种用于多维序列数据库的专用聚类技术。本文的第二部分提出了一种新的基于密度的分类方法。分类旨在为未知对象分配类标签。在最近的几十年中,已经研究了各种用于分类对象的方法。在文献中对基于统计方法的分类器进行了最深入的研究,并得出了无症状行为和分类偏差之类的结果。为了应用统计分类器,必须估计物体的密度。在这项工作中,提出了密度估计器的层次结构,该层次结构使得随时可以进行对象分类。此外,本文还提出了一种使用子空间簇进行高维分类的新方法。在对真实世界和合成数据进行的全面实验中,基于效率和有效性对所提出的基于密度的聚类和分类方法进行了评估。

著录项

  • 作者

    Krieger Ralph;

  • 作者单位
  • 年度 2008
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号