首页> 外文学位 >Robust Significant Feature Detection by Learning Discriminant Boundary in Multi-dimensional Space of Statistical Attributes.
【24h】

Robust Significant Feature Detection by Learning Discriminant Boundary in Multi-dimensional Space of Statistical Attributes.

机译:通过学习统计属性多维空间中的判别边界,进行鲁棒的重要特征检测。

获取原文
获取原文并翻译 | 示例

摘要

This thesis proposes a novel framework to robustly detect significant features by adaptively optimizing the integration of multiple feature scoring metrics. Significant feature detection is a critical process in many kinds of big-data applications. Its main purpose is to mine a complex dataset, which contains a large number of features, to detect features whose "behaviors" are significantly different between conditions. Such features can be genes, genomic methylation states, relationships between linguistic entities, and so on. For example, high-throughput technologies (e.g., Microarray, Deep Sequencing, etc.) have become pervasive in biological and biomedical investigations to simultaneously measure tens of thousands genomic features (e.g., genes, RNA splicing, DNA methylation, mutations, etc.). Accurate identification of significant features is essential for designing the follow-up experiments. In another scenario, a huge volume of unstructured messages is being poured from vast unsynchronized communication threads onto online social platforms, which are often too overwhelming for human users to follow. Automatic discovery of temporal dependency (i.e., one kind of significant feature) between messages can greatly facilitate communications.;A common way to detect significant features is to rank each feature by a score approximating its relevance to the index of interest. For example, in genome-wide data analysis, people are interested in detecting genomic features that are differentially expressed between two conditions, usually a target group and a control group. Performing inter/intra group statistical tests is a traditional approach to measure the significance of how each genomic feature is differentially expressed. Each type of statistical test has its own advantages in characterizing certain aspects of differences between population means and often assumes a relatively simple data distribution (e.g., Gaussian, Poisson, negative binomial, etc.), which may not be well met by the datasets of interest. It is known that weak assumptions about data distributions can lead to poor results when dealing with complex differential expression patterns. Therefore, it is critical to choose the appropriate statistical test, and more generally, feature scoring metric that suits the underlying data distribution.;This thesis will be composed of two major parts. The first part defines the mathematical model and briefly introduces the algorithm at a high level. In order to better explain the workflow, we emphasize the above differential expression problem in genome-wide data analysis, yet the framework is not limited to this application. The proposed framework aims to capture differential expression information more comprehensively by learning the optimized integration of multiple statistical attributes, each of which has relatively limited capacity to summarize the observed differential expression information. The problem is then framed into a learning problem: learn optimal discriminant boundary in a multi-dimensional space of basic attributes, each of which can be a test statistic or other feature scoring metric. The learning problem is further mathematically formulated as a constrained optimization problem that aims to maximize discoveries under a user-defined false discovery rate (FDR). FDR defines the expected type I error ratio when conducting multiple comparisons. FDR control is a widely used approach in deciding the cutoff point, which distinguishes significance and non-significance.;We developed an effective algorithm named "Discriminant-Cut" to solve an instantiation of this problem. Extensive comparisons of Discriminant-Cut with other cutting-edge methods were carried out to demonstrate its robustness and effectiveness. The results showed that it is significantly advantageous to combine multiple basic attributes in detecting differential expressed genomic features in the application of genome-wide data analysis. Both synthesized datasets and real-world datasets will be used in the comparisons.;In the second part we will extend the framework to another application -- automatic inference of conversation structures in online text messages. We plan to analyze short text conversations and frame the problem into a significant feature selection problem, in which each feature is a connection between two randomly chosen messages.;This thesis will also present key implementation details that affect the performance of Discriminant-Cut. For example, we incorporated several heuristics in the implementation of the algorithm to greatly improve its efficiency. This allows the algorithm to run fast in practice. We plan to enhance the framework with a parallel computing capability so that it can be deployed on large clusters to take advantage of parallel computing. In addition, we will develop a "Discriminant-Cut analysis suite" that provide user-friendly GUIs for users to not only analyze their datasets without complex operations and parameter settings, but also customize their own feature scoring metrics.
机译:本文提出了一种新颖的框架,通过自适应地优化多个特征评分指标的集成来稳健地检测重要特征。重要特征检测是许多大数据应用程序中的关键过程。它的主要目的是挖掘包含大量特征的复杂数据集,以检测其“行为”在条件之间显着不同的特征。这样的特征可以是基因,基因组甲基化状态,语言实体之间的关系等等。例如,高通量技术(例如,微阵列,深度测序等)已广泛应用于生物学和生物医学研究中,以同时测量数以万计的基因组特征(例如,基因,RNA剪接,DNA甲基化,突变等)。 。准确识别重要特征对于设计后续实验至关重要。在另一种情况下,大量的非结构化消息正从庞大的不同步的通信线程涌入到在线社交平台上,对于人类用户而言,这通常太压倒性的关注。自动发现消息之间的时间依存关系(即一种重要特征)可以极大地促进通信。;检测重要特征的一种常用方法是通过对每个特征的评分使其与感兴趣索引的相关性近似来对它们进行排名。例如,在全基因组数据分析中,人们对检测在两种条件(通常是目标组和对照组)之间差异表达的基因组特征感兴趣。进行组间/组内统计测试是一种传统方法,可用来衡量差异表达每个基因组特征的重要性。每种类型的统计检验在描述总体均值之间的差异的某些方面时都有其自身的优势,并且通常假定数据分布相对简单(例如,高斯,泊松,负二项式等),而数据集可能无法很好地满足这些需求。利益。众所周知,在处理复杂的差异表达模式时,对数据分布的较弱假设可能导致较差的结果。因此,选择合适的统计检验至为关键,更普遍的是,选择适合基础数据分布的特征评分标准。本文将由两个主要部分组成。第一部分定义了数学模型,并简要介绍了该算法。为了更好地解释工作流程,我们在全基因组数据分析中强调了上述差异表达问题,但是该框架并不局限于此应用。所提出的框架旨在通过学习多个统计属性的优化集成来更全面地捕获差异表达信息,每种统计属性汇总观察到的差异表达信息的能力相对有限。然后将该问题归结为一个学习问题:在基本属性的多维空间中学习最佳判别边界,每个基本属性可以是测试统计量或其他特征评分指标。学习问题在数学上进一步公式化为约束优化问题,旨在在用户定义的错误发现率(FDR)下最大化发现。 FDR在进行多次比较时定义了预期的I类错误率。 FDR控制是一种确定临界点的方法,可以区分有效点和无效点。我们开发了一种有效的算法“ Discriminant-Cut”来解决该问题的实例。对Discriminant-Cut和其他尖端方法进行了广泛的比较,以证明其鲁棒性和有效性。结果表明,在检测全基因组数据分析中差异表达的基因组特征时,结合多个基本属性非常有利。比较中将使用综合数据集和现实世界数据集。在第二部分中,我们将框架扩展到另一个应用程序-在线文本消息中对话结构的自动推断。我们计划分析简短的文本对话,并将问题框架化为一个重要的特征选择问题,其中每个特征都是两个随机选择的消息之间的连接。本文还介绍了影响Discriminant-Cut性能的关键实现细节。例如,我们在算法的实现中加入了几种启发式方法,以大大提高其效率。这使算法在实践中可以快速运行。我们计划通过并行计算功能来增强该框架,以便可以将其部署在大型群集上以利用并行计算的优势。此外,我们将开发一个“区分剪切分析套件”,该套件提供用户友好的GUI,使用户不仅可以分析其数据集而无需复杂的操作和参数设置,还可以自定义他们自己的功能评分指标。

著录项

  • 作者

    Bei, Yuanzhe.;

  • 作者单位

    Brandeis University.;

  • 授予单位 Brandeis University.;
  • 学科 Computer science.;Biochemistry.;Bioinformatics.;Genetics.;Mathematics.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 321 p.
  • 总页数 321
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号