首页> 外文学位 >Integrated feature subset selection/extraction with applications in bioinformatics.
【24h】

Integrated feature subset selection/extraction with applications in bioinformatics.

机译:集成的功能子集选择/提取以及生物信息学中的应用。

获取原文
获取原文并翻译 | 示例

摘要

Feature subset selection and extraction algorithms are actively and extensively studied in machine learning literature to reduce the dimensionality of feature space, since high dimensional data sets are generally not efficiently and effectively handled by a large array of machine learning and pattern recognition algorithms. When we stride into the analysis of large scale bioinformatics data sets, such as microarray gene expression data sets, the high dimensionality of feature space compounded with the low dimensionality of sample space, creates even more problems for data analysis algorithms.;Two foremost characteristics of microarray gene expression data sets are: (1) the correlation between features (genes) and (2) the availability of domain knowledge in computable format. In this dissertation, we will study effective feature selection and extraction algorithms with applications to the analysis of the new emerging data sets in the bioinformatics domain. Microarray gene expression data set, the result of large scale RNA profiling techniques, is our primary focus in this thesis. Several novel feature (gene) selection and extraction algorithms are proposed to deal with peculiarities on microarray gene expression data set.;To address the first characteristic of the microarray gene expression data set, we first propose a general feature selection algorithm called Boost Feature Subset Selection (BFSS) based on permutation analysis to broaden the scope of selected gene set and thus improve classification performance. In BFSS, subsequent features to be selected focus on those samples where previously selected features fail. Our experiments showed the benefit of BFSS for t-score and S2N (signal to noise) based single gene scores on a variety of publicly available microarray gene expression data sets.;We then examine the correlations among features (genes) explicitly to see if such correlations are informative for the purpose of sample classification. This results in our gene extraction algorithm called virtual gene. A virtual gene is a group of genes whose expression levels are combined linearly. The combined expression levels of a virtual gene instead of the real gene expression levels are used for sample classification. Our experiments confirm that by taking into consideration the correlations between gene pairs, we could indeed build a better sample classifier.;Microarray gene expression data set only represents one aspect of our knowledge of the underlying biological system. Currently there are lots of biological knowledge in computable format that can be accessed from Internet. Continue to address the second characteristic of the microarray gene expression data set, we investigate the integration of domain knowledge, such as those imbedded in gene ontology annotations, for the use of gene selection and extraction. (Abstract shortened by UMI.).
机译:为了减少特征空间的维数,在机器学习文献中对特征子集的选择和提取算法进行了积极而广泛的研究,因为高维数据集通常无法由一大堆机器学习和模式识别算法有效地处理。当我们深入分析大规模生物信息学数据集(例如微阵列基因表达数据集)时,特征空间的高维与样本空间的低维相结合,给数据分析算法带来了更多的问题。微阵列基因表达数据集是:(1)特征(基因)之间的相关性,以及(2)可计算格式的领域知识的可用性。在本文中,我们将研究有效的特征选择和提取算法,并将其应用于生物信息学领域新兴数据集的分析。微阵列基因表达数据集是大规模RNA分析技术的结果,是我们的主要研究重点。针对微阵列基因表达数据集的特殊性,提出了几种新颖的特征(基因)选择和提取算法。为了解决微阵列基因表达数据集的第一个特征,我们首先提出了一种通用的特征选择算法Boost Feature Subset Selection。 (BFSS)基于置换分析,以拓宽所选基因集的范围,从而提高分类性能。在BFSS中,要选择的后续功能集中于先前选择的功能失败的那些样本。我们的实验显示了BFSS在各种公开可用的微阵列基因表达数据集上基于t分数和基于S2N(信噪比)的单基因评分的优势;然后我们明确检查特征(基因)之间的相关性,以查看是否相关性有助于样本分类。这导致我们的基因提取算法称为虚拟基因。虚拟基因是一组表达水平线性组合的基因。虚拟基因的组合表达水平而非真实基因的表达水平用于样品分类。我们的实验证实,通过考虑基因对之间的相关性,我们确实可以构建更好的样本分类器。微阵列基因表达数据集仅代表我们对基础生物学系统的了解的一个方面。当前,有许多可计算格式的生物学知识可以从Internet访问。继续解决微阵列基因表达数据集的第二个特征,我们研究了领域知识的整合,例如嵌入基因本体注释中的知识,以供基因选择和提取。 (摘要由UMI缩短。)。

著录项

  • 作者

    Xu, Xian.;

  • 作者单位

    State University of New York at Buffalo.;

  • 授予单位 State University of New York at Buffalo.;
  • 学科 Computer Science.;Biology Bioinformatics.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 209 p.
  • 总页数 209
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号