首页> 外文学位 >Class discovery via feature selection in unsupervised settings.
【24h】

Class discovery via feature selection in unsupervised settings.

机译:通过在无人监督的设置中进行功能选择来发现类。

获取原文
获取原文并翻译 | 示例

摘要

Identifying genes linked to the appearance of certain types of cancers and their phenotypes is a well-known and challenging problem in bioinformatics. Discovering marker genes which, upon genetic mutation, drive the proliferation of different types and subtypes of cancer is critical for the development of advanced tests and therapies that will specifically identify, target, and treat certain cancers. Therefore, it is crucial to find methods that are successful in recovering "cancer-critical genes" from the (usually much larger) set of all genes in the human genome.;We approach this problem in the statistical context as a feature (or variable) selection problem for clustering, in the case where the number of important features is typically small (or rare) and the signal of each important feature is typically minimal (or weak). Genetic datasets typically consist of hundreds of samples (n) each with tens of thousands gene-level measurements (p), resulting in the well-known statistical "large p small n" problem. The class or cluster identification is based on the clinical information associated with the type or subtype of the cancer (either known or unknown) for each individual. We discuss and develop novel feature ranking methods, which complement and build upon current methods in the field. These ranking methods are used to select features which contain the most significant information for clustering. Retaining only a small set of useful features based on this ranking aids in both a reduction in data dimensionality, as well as the identification of a set of genes that are crucial in understanding cancer subtypes.;In this paper, we present an outline of cutting-edge feature selection methods, and provide a detailed explanation of our own contributions to the field. We explain both the practical properties and theoretical advantages of the new tools that we have developed. Additionally, we explore a well-developed case study applying these new feature selection methods to different levels of genetic data to explore their practical implementation within the field of bioinformatics.
机译:在生物信息学中,鉴定与某些类型的癌症及其表型有关的基因是一个众所周知且具有挑战性的问题。发现标记基因,这些标记基因一旦发生基因突变,就可以驱动不同类型和亚型癌症的扩散,这对于开发能够特异性识别,靶向和治疗某些癌症的先进测试和疗法至关重要。因此,至关重要的是找到一种能够成功地从(通常更大的)人类基因组所有基因集中恢复“癌症关键基因”的方法。在重要特征的数量通常很少(或很少)并且每个重要特征的信号通常很小(或微弱)的情况下,进行聚类的选择问题。遗传数据集通常由数百个样本(n)组成,每个样本具有成千上万个基因水平的测量值(p),从而导致众所周知的统计“大p小n”问题。分类或簇的识别基于与每个个体的癌症类型或亚型(已知或未知)相关的临床信息。我们讨论和开发新颖的特征分级方法,以补充并建立在该领域的当前方法之上。这些排序方法用于选择包含最重要的信息进行聚类的要素。根据该排名仅保留一小部分有用的功能,既有助于减少数据维数,也有助于识别对理解癌症亚型至关重要的一组基因。边缘特征选择方法,并详细说明我们在该领域的贡献。我们将解释我们开发的新工具的实用特性和理论优势。此外,我们探索了一个完善的案例研究,将这些新的特征选择方法应用于不同级别的遗传数据,以探索其在生物信息学领域的实际应用。

著录项

  • 作者

    Curtis, Jessica.;

  • 作者单位

    Boston University.;

  • 授予单位 Boston University.;
  • 学科 Statistics.;Mathematics.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 136 p.
  • 总页数 136
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号