首页> 外文学位 >A Scalable Physics-based Data Modeling Framework to Unsupervised High-Dimensional Data Mining.
【24h】

A Scalable Physics-based Data Modeling Framework to Unsupervised High-Dimensional Data Mining.

机译:可扩展的基于物理的数据建模框架,可实现无监督的高维数据挖掘。

获取原文
获取原文并翻译 | 示例

摘要

Today's modeling and analysis of high-dimensional data is either based on human expertise to hand-craft a set of task-specific data, which suffers significantly from the ever-increasing complexity and the unknown patterns of the new data; or is based on simple data-driven approaches which tend to lose the fundamentally physical insights of real world datasets. Therefore, it is very difficult with today's modeling practice to efficiently, effectively, and unsupervisedly detect reliable patterns and information in high-dimensional data. In this dissertation, we developed a scalable data modeling framework that utilizes modern theoretical physics for unsupervised high-dimensional data analysis and mining. Not only does it have a solid theoretical background, but it is capable of handling different tasks with different capability (clustering, anomaly detection and feature selections, etc.). This framework also has probabilistic interpretation that avoids the sensitivity from scaling parameter tuning or noise appearance in real world applications. Furthermore, we presented a fast approximated approach to make such a framework applicable on large-scale datasets with high efficiency and effectiveness.;During my dissertation research, we made the following salient contributions: We proposed a diffusion-based Aggregated Heat Kernel (AHK) to improve the clustering stability, and a Local Density Affinity Transformation (LDAT) to correct the bias originated from different cluster densities. Our proposed framework integrates these two techniques systematically. As a result, it not only provides an advanced noise-resisting and density-aware spectral mapping to the original datasets, but also demonstrates the clustering stability during the process of tuning the scaling parameters.;We devised a Local Anomaly Descriptor (LAD) that faithfully reveals the intrinsic neighborhood density to detect anomalies. LAD bridges global and local properties, which makes it self-adaptive with different samples' neighborhood. To offer better stability of local density measurement on scaling parameter tuning, we formulated a Fermi Density Descriptor (FDD). FDD steadily distinguishes anomalies from normal instances with most of the scaling parameter settings. We also quantified and examined the effect of different Laplacian normalizations with the purpose of detecting anomalies.;We developed a robust feature selection algorithm, called Noise-Resistant Unsupervised Feature Selection (NRFS). It measures multi-perspective correlation that reflects the importance of features with respect to noise-resistant instance representatives and different global trends from spectral decomposition. In this way, the model concisely captures a wide variety of local patterns, and selects representative features with high quality.;We mitigated the space and time complexity of spectral embedding in order to apply the above techniques to real-world large data mining, by proposing a Diverse Power Iteration Embedding (DPIE). We tested DPIE on various applications (e.g., clustering, anomaly detection and feature selection). The experimental results showed that our proposed DPIE is more effective than popular spectral approximation methods, and even obtains the similar quality of classic spectral embedding derived from a classic eigen-decompositions. Moreover, DPIE is extremely fast on big data applications.;Finally, we provided a brief introduction of our on-going work and future research directions. By elaborating our developed works within the proposed framework, we showed that our scalable physic-based unsupervised data modeling is potent and promising for large-scale and high-dimensional data analysis, data mining, and knowledge discovery. It is a rich and fruitful area for research in terms of both theory and applications.
机译:如今,对高维数据的建模和分析是基于人类的专业知识来手工制作一组特定于任务的数据,而新数据的复杂性不断提高以及新模式的未知性使其遭受了极大的困扰。或基于简单的数据驱动方法,这些方法往往会丢失现实世界数据集的基本物理见解。因此,在当今的建模实践中,很难高效,有效且无监督地检测高维数据中的可靠模式和信息。本文开发了一种可扩展的数据建模框架,该框架利用现代理论物理学进行无监督的高维数据分析和挖掘。它不仅具有扎实的理论背景,而且能够处理具有不同功能(聚类,异常检测和特征选择等)的不同任务。该框架还具有概率解释功能,可避免在实际应用中因缩放参数调整或出现噪声而产生灵敏度。此外,我们提出了一种快速近似的方法,以使这种框架能够高效高效地应用于大规模数据集。;在我的论文研究中,我们做出了以下突出贡献:我们提出了一种基于扩散的聚集热核(AHK)以提高聚类稳定性,并使用局部密度亲和变换(LDAT)纠正源自不同聚类密度的偏差。我们提出的框架将这两种技术系统地集成在一起。结果,它不仅为原始数据集提供了先进的抗噪和密度感知频谱映射,而且还展示了在调整缩放参数的过程中的聚类稳定性。;我们设计了一个局部异常描述符(LAD)忠实地揭示内在邻域密度以检测异常。 LAD在全球和本地属性之间架起了桥梁,使其适应不同样本的邻域。为了在缩放参数调整时提供更好的局部密度测量稳定性,我们制定了费米密度描述符(FDD)。 FDD通过大多数缩放参数设置可稳定地将异常与正常实例区分开。为了检测异常,我们还量化并检查了不同拉普拉斯标准化的影响。我们开发了一种鲁棒的特征选择算法,称为抗噪无监督特征选择(NRFS)。它测量多角度相关性,以反映功能对于抗噪实例代表的重要性以及来自频谱分解的不同全局趋势。通过这种方式,该模型可以简洁地捕获各种局部模式,并选择高质量的代表性特征。我们减轻了频谱嵌入的空间和时间复杂性,以便将上述技术应用于现实世界中的大数据挖掘,方法是:提出了多元幂迭代嵌入(DPIE)。我们在各种应用程序(例如,群集,异常检测和功能选择)上测试了DPIE。实验结果表明,我们提出的DPIE比流行的光谱逼近方法更有效,甚至可以从经典特征分解获得与经典光谱嵌入相似的质量。此外,DPIE在大数据应用程序上的速度非常快。最后,我们简要介绍了我们正在进行的工作和未来的研究方向。通过在建议的框架内详细阐述我们的开发成果,我们表明我们可扩展的基于物理的无监督数据建模在大规模和高维数据分析,数据挖掘和知识发现方面具有强大的潜力。从理论和应用两方面,这是一个丰富而富有成果的研究领域。

著录项

  • 作者

    Huang, Hao.;

  • 作者单位

    State University of New York at Stony Brook.;

  • 授予单位 State University of New York at Stony Brook.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 221 p.
  • 总页数 221
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:54:03

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号