首页> 外文学位 >Robust Significant Feature Detection by Learning Discriminant Boundary in Multi-dimensional Space of Statistical Attributes.

【24h】

Robust Significant Feature Detection by Learning Discriminant Boundary in Multi-dimensional Space of Statistical Attributes.

机译：通过学习统计属性多维空间中的判别边界，进行鲁棒的重要特征检测。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This thesis proposes a novel framework to robustly detect significant features by adaptively optimizing the integration of multiple feature scoring metrics. Significant feature detection is a critical process in many kinds of big-data applications. Its main purpose is to mine a complex dataset, which contains a large number of features, to detect features whose "behaviors" are significantly different between conditions. Such features can be genes, genomic methylation states, relationships between linguistic entities, and so on. For example, high-throughput technologies (e.g., Microarray, Deep Sequencing, etc.) have become pervasive in biological and biomedical investigations to simultaneously measure tens of thousands genomic features (e.g., genes, RNA splicing, DNA methylation, mutations, etc.). Accurate identification of significant features is essential for designing the follow-up experiments. In another scenario, a huge volume of unstructured messages is being poured from vast unsynchronized communication threads onto online social platforms, which are often too overwhelming for human users to follow. Automatic discovery of temporal dependency (i.e., one kind of significant feature) between messages can greatly facilitate communications.;A common way to detect significant features is to rank each feature by a score approximating its relevance to the index of interest. For example, in genome-wide data analysis, people are interested in detecting genomic features that are differentially expressed between two conditions, usually a target group and a control group. Performing inter/intra group statistical tests is a traditional approach to measure the significance of how each genomic feature is differentially expressed. Each type of statistical test has its own advantages in characterizing certain aspects of differences between population means and often assumes a relatively simple data distribution (e.g., Gaussian, Poisson, negative binomial, etc.), which may not be well met by the datasets of interest. It is known that weak assumptions about data distributions can lead to poor results when dealing with complex differential expression patterns. Therefore, it is critical to choose the appropriate statistical test, and more generally, feature scoring metric that suits the underlying data distribution.;This thesis will be composed of two major parts. The first part defines the mathematical model and briefly introduces the algorithm at a high level. In order to better explain the workflow, we emphasize the above differential expression problem in genome-wide data analysis, yet the framework is not limited to this application. The proposed framework aims to capture differential expression information more comprehensively by learning the optimized integration of multiple statistical attributes, each of which has relatively limited capacity to summarize the observed differential expression information. The problem is then framed into a learning problem: learn optimal discriminant boundary in a multi-dimensional space of basic attributes, each of which can be a test statistic or other feature scoring metric. The learning problem is further mathematically formulated as a constrained optimization problem that aims to maximize discoveries under a user-defined false discovery rate (FDR). FDR defines the expected type I error ratio when conducting multiple comparisons. FDR control is a widely used approach in deciding the cutoff point, which distinguishes significance and non-significance.;We developed an effective algorithm named "Discriminant-Cut" to solve an instantiation of this problem. Extensive comparisons of Discriminant-Cut with other cutting-edge methods were carried out to demonstrate its robustness and effectiveness. The results showed that it is significantly advantageous to combine multiple basic attributes in detecting differential expressed genomic features in the application of genome-wide data analysis. Both synthesized datasets and real-world datasets will be used in the comparisons.;In the second part we will extend the framework to another application -- automatic inference of conversation structures in online text messages. We plan to analyze short text conversations and frame the problem into a significant feature selection problem, in which each feature is a connection between two randomly chosen messages.;This thesis will also present key implementation details that affect the performance of Discriminant-Cut. For example, we incorporated several heuristics in the implementation of the algorithm to greatly improve its efficiency. This allows the algorithm to run fast in practice. We plan to enhance the framework with a parallel computing capability so that it can be deployed on large clusters to take advantage of parallel computing. In addition, we will develop a "Discriminant-Cut analysis suite" that provide user-friendly GUIs for users to not only analyze their datasets without complex operations and parameter settings, but also customize their own feature scoring metrics.

机译：本文提出了一种新颖的框架，通过自适应地优化多个特征评分指标的集成来稳健地检测重要特征。重要特征检测是许多大数据应用程序中的关键过程。它的主要目的是挖掘包含大量特征的复杂数据集，以检测其“行为”在条件之间显着不同的特征。这样的特征可以是基因，基因组甲基化状态，语言实体之间的关系等等。例如，高通量技术（例如，微阵列，深度测序等）已广泛应用于生物学和生物医学研究中，以同时测量数以万计的基因组特征（例如，基因，RNA剪接，DNA甲基化，突变等）。。准确识别重要特征对于设计后续实验至关重要。在另一种情况下，大量的非结构化消息正从庞大的不同步的通信线程涌入到在线社交平台上，对于人类用户而言，这通常太压倒性的关注。自动发现消息之间的时间依存关系（即一种重要特征）可以极大地促进通信。;检测重要特征的一种常用方法是通过对每个特征的评分使其与感兴趣索引的相关性近似来对它们进行排名。例如，在全基因组数据分析中，人们对检测在两种条件（通常是目标组和对照组）之间差异表达的基因组特征感兴趣。进行组间/组内统计测试是一种传统方法，可用来衡量差异表达每个基因组特征的重要性。每种类型的统计检验在描述总体均值之间的差异的某些方面时都有其自身的优势，并且通常假定数据分布相对简单（例如，高斯，泊松，负二项式等），而数据集可能无法很好地满足这些需求。利益。众所周知，在处理复杂的差异表达模式时，对数据分布的较弱假设可能导致较差的结果。因此，选择合适的统计检验至为关键，更普遍的是，选择适合基础数据分布的特征评分标准。本文将由两个主要部分组成。第一部分定义了数学模型，并简要介绍了该算法。为了更好地解释工作流程，我们在全基因组数据分析中强调了上述差异表达问题，但是该框架并不局限于此应用。所提出的框架旨在通过学习多个统计属性的优化集成来更全面地捕获差异表达信息，每种统计属性汇总观察到的差异表达信息的能力相对有限。然后将该问题归结为一个学习问题：在基本属性的多维空间中学习最佳判别边界，每个基本属性可以是测试统计量或其他特征评分指标。学习问题在数学上进一步公式化为约束优化问题，旨在在用户定义的错误发现率（FDR）下最大化发现。 FDR在进行多次比较时定义了预期的I类错误率。 FDR控制是一种确定临界点的方法，可以区分有效点和无效点。我们开发了一种有效的算法“ Discriminant-Cut”来解决该问题的实例。对Discriminant-Cut和其他尖端方法进行了广泛的比较，以证明其鲁棒性和有效性。结果表明，在检测全基因组数据分析中差异表达的基因组特征时，结合多个基本属性非常有利。比较中将使用综合数据集和现实世界数据集。在第二部分中，我们将框架扩展到另一个应用程序-在线文本消息中对话结构的自动推断。我们计划分析简短的文本对话，并将问题框架化为一个重要的特征选择问题，其中每个特征都是两个随机选择的消息之间的连接。本文还介绍了影响Discriminant-Cut性能的关键实现细节。例如，我们在算法的实现中加入了几种启发式方法，以大大提高其效率。这使算法在实践中可以快速运行。我们计划通过并行计算功能来增强该框架，以便可以将其部署在大型群集上以利用并行计算的优势。此外，我们将开发一个“区分剪切分析套件”，该套件提供用户友好的GUI，使用户不仅可以分析其数据集而无需复杂的操作和参数设置，还可以自定义他们自己的功能评分指标。

著录项

作者
Bei, Yuanzhe.;
展开▼
作者单位

Brandeis University.;

展开▼
授予单位 Brandeis University.;
学科 Computer science.;Biochemistry.;Bioinformatics.;Genetics.;Mathematics.
学位 Ph.D.
年度 2016
页码 321 p.
总页数 321
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Robust differential expression analysis by learning discriminant boundary in multi-dimensional space of statistical attributes [J] . Yuanzhe Bei, Pengyu Hong BMC Bioinformatics . 2016,第1期

机译：通过学习统计属性多维空间中的判别边界进行鲁棒的差异表达分析
2. Multi-class pattern classification using single, multi-dimensional feature-space feature extraction evolved by multi-objective genetic programming and its application to network intrusion detection [J] . Khaled Badran, Peter Rockett Genetic programming and evolvable machines . 2012,第1期

机译：基于多目标遗传规划的单维多维特征空间特征提取的多类模式分类及其在网络入侵检测中的应用
3. Robust dimensionality reduction via feature space to feature space distance metric learning [J] . Li Bo, Fan Zhang-Tao, Zhang Xiao-Long, Neural Networks: The Official Journal of the International Neural Network Society . 2019,第期

机译：通过特征空间来减少强大的维数，以具有空间距离度量学习
4. Online detection and modeling of safety boundaries for aerospace applications using active learning and Bayesian statistics [C] . He Yuning International Joint Conference on Neural Networks . 2015

机译：使用主动学习和贝叶斯统计数据对航空航天应用安全边界进行在线检测和建模
5. Robust high range resolution radar target identification using a statistical feature-based classifier with feature level fusion [D] . Mitchell, Richard Allen 1997

机译：使用基于统计的基于特征的融合分类器的分类器进行高分辨力的高分辨率雷达目标识别
6. Robust differential expression analysis by learning discriminant boundary in multi-dimensional space of statistical attributes [O] . Yuanzhe Bei, Pengyu Hong 2016

机译：通过学习统计属性多维空间中的判别边界进行鲁棒的差异表达分析
7. Robust differential expression analysis by learning discriminant boundary in multi-dimensional space of statistical attributes [O] . 2016

机译：通过学习统计属性多维空间中的判别边界进行鲁棒的差异表达分析

Robust Significant Feature Detection by Learning Discriminant Boundary in Multi-dimensional Space of Statistical Attributes.

摘要

著录项

相似文献

相关主题

期刊订阅