首页> 美国卫生研究院文献>other >Mining Non-Redundant High Order Correlations in Binary Data
【2h】

Mining Non-Redundant High Order Correlations in Binary Data

机译:在二进制数据中挖掘非冗余高阶相关

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Many approaches have been proposed to find correlations in binary data. Usually, these methods focus on pair-wise correlations. In biology applications, it is important to find correlations that involve more than just two features. Moreover, a set of strongly correlated features should be non-redundant in the sense that the correlation is strong only when all the interacting features are considered together. Removing any feature will greatly reduce the correlation.In this paper, we explore the problem of finding non-redundant high order correlations in binary data. The high order correlations are formalized using multi-information, a generalization of pairwise mutual information. To reduce the redundancy, we require any subset of a strongly correlated feature subset to be weakly correlated. Such feature subsets are referred to as Non-redundant Interacting Feature Subsets (NIFS). Finding all NIFSs is computationally challenging, because in addition to enumerating feature combinations, we also need to check all their subsets for redundancy. We study several properties of NIFSs and show that these properties are useful in developing efficient algorithms. We further develop two sets of upper and lower bounds on the correlations, which can be incorporated in the algorithm to prune the search space. A simple and effective pruning strategy based on pair-wise mutual information is also developed to further prune the search space. The efficiency and effectiveness of our approach are demonstrated through extensive experiments on synthetic and real-life datasets.
机译:已经提出了许多方法来找到二进制数据中的相关性。通常,这些方法专注于成对相关。在生物学应用中,重要的是找到不仅涉及两个特征的相关性。此外,从强烈的意义上说,一组强相关的特征应该是非冗余的,只有当所有相互作用的特征都一起考虑时,相关才很强。删除任何特征将大大减少相关性。本文探讨了在二进制数据中查找非冗余高阶相关性的问题。高阶相关使用多信息形式化,即成对互信息的概括。为了减少冗余,我们要求强相关特征子集的任何子集都是弱相关的。这样的特征子集被称为非冗余交互特征子集(NIFS)。查找所有NIFS都具有计算上的挑战,因为除了枚举特征组合之外,我们还需要检查其所有子集的冗余性。我们研究了NIFS的几个属性,并表明这些属性对于开发有效的算法很有用。我们进一步针对相关性开发了两组上下限,可以将其纳入算法中以缩小搜索空间。还开发了一种基于成对的互信息的简单有效的修剪策略,以进一步修剪搜索空间。通过对合成和现实数据集进行的大量实验证明了我们方法的效率和有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号