首页> 外文学位 >Methods for Statistical Association Mining by Variable-to-Set Affinity Testing
【24h】

Methods for Statistical Association Mining by Variable-to-Set Affinity Testing

机译:通过变量对集合的亲和力测试进行统计关联挖掘的方法

获取原文
获取原文并翻译 | 示例

摘要

Statistical data mining refers to methods for identifying and validating interesting patterns from an overabundance of data. Data mining tasks in which the objective involves pairwise relationships between variables are known as association mining. In general, features sought by association mining methods are sets of variables, often small subsets of a larger collection, that are more associated internally than externally. Methods vary in both the measure of association that is studied and the algorithm by which associated sets are identified. This dissertation discusses provide a generalized framework for association mining called Variable-to-Set Affinity Testing (VSAT). Unlike conventional techniques for clustering or community detection, which usually maximize a score from a dissimilarity or adjacency matrix, the VSAT approach is an adaptive procedure grounded in statistical hypothesis testing principles. The framework is adaptable to a broad class of measurements for variable relationships, and is equipped with theoretical guarantees of error control.;This dissertation also presents in detail two new association mining methods built in the VSAT framework. The first, Differential Correlation Mining (DCM), identifies variable sets that have higher average pairwise correlation in one sample condition than in another. Such artifacts are of scientific interest in many fields, including statistical genetics and neuroscience. Differential Correlation Mining is applied to high-dimensional data sets in these two fields. The second method, Coherent Set Mining (CSM), is a novel approach to association mining in binary data. Dichotomous observations are assumed to derive from a latent variable of interest via thresholding. The Coherent Set Mining method identifies variable sets that are strongly associated in the latent measure, despite distortions in the association structure of the observed data due to the thresholding process. Coherent Set Mining is applied to problems in text mining, statistical genetics, and product recommendation.
机译:统计数据挖掘是指从数据过多中识别和验证有趣模式的方法。目标涉及变量之间成对关系的数据挖掘任务称为关联挖掘。通常,关联挖掘方法寻求的特征是变量集,通常是较大集合的小子集,它们在内部比在外部具有更大的关联性。方法在研究的关联度量和识别关联集的算法上都不同。本文讨论提供了一种通用的关联挖掘框架,称为变量集相似性测试(VSAT)。 VSAT方法不同于传统的聚类或社区检测技术,该技术通常使差异或邻接矩阵的得分最大化,而VSAT方法是一种基于统计假设检验原理的自适应程序。该框架适用于广泛的变量关系度量,并提供了错误控制的理论保证。;本文还详细介绍了在VSAT框架中构建的两种新的关联挖掘方法。第一个是差分相关挖掘(DCM),它标识在一个样本条件下比在另一样本条件下具有更高平均成对相关性的变量集。在许多领域,包括统计遗传学和神经科学,此类文物具有科学意义。差分相关挖掘应用于这两个字段中的高维数据集。第二种方法是相干集合挖掘(CSM),是一种在二进制数据中进行关联挖掘的新颖方法。假设二分法观测值是通过阈值从感兴趣的潜在变量中得出的。相干集挖掘方法可识别在潜在度量中紧密关联的变量集,尽管由于阈值处理而导致观测数据的关联结构发生了扭曲。相干集合挖掘适用于文本挖掘,统计遗传学和产品推荐中的问题。

著录项

  • 作者

    Bodwin, Kelly Nicole.;

  • 作者单位

    The University of North Carolina at Chapel Hill.;

  • 授予单位 The University of North Carolina at Chapel Hill.;
  • 学科 Statistics.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 126 p.
  • 总页数 126
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号