首页> 外文期刊>Machine Learning >Inclusion of Textual Documentation in the Analysis of Multidimensional Data Sets: Application to Gene Expression Data
【24h】

Inclusion of Textual Documentation in the Analysis of Multidimensional Data Sets: Application to Gene Expression Data

机译:在多维数据集的分析中包括文本文档:在基因表达数据中的应用

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Recently, biology has been confronted with large multidimensional gene expression data sets where the expression of thousands of genes is measured over dozens of conditions. The patterns in gene expression are frequently explained retrospectively by underlying biological principles. Here we present a method that uses text analysis to help find meaningful gene expression patterns that correlate with the underlying biology described in scientilic literature. The main challenge is that the literature about an individual gene is not homogenous and may addresses many unrelated aspects of the gene. In the first part of the paper we present and evaluate the neighbor divergence per gene (NDPG) method that assigns a score to a given subgroup of genes indicating the likelihood that the genes share a biological property or function. To do this, it uses only a reference index that connects genes to documents, and a corpus including those documents. In the second part of the paper we present an approach, optimizing separating projections (OSP), to search for linear projections in gene expression data that separate functionally related groups of genes from the rest of the genes; the objective function in our search is the NDPG score of the positively projected genes. A successful search, therefore, should identify patterns in gene expression data that correlate with meaningful biology. We apply OSP to a published gene expression data set; it discovers many biologically relevant projections. Since the method requires only numerical measurements (in this case expression) about entities (genes) with textual documentation (literature), we conjecture that this method could be transferred easily to other domains. The method should be able to identify relevant patterns even if the documentation for each entity pertains to many disparate subjects that are unrelated to each other.
机译:最近,生物学面临着大型的多维基因表达数据集,其中数十种条件下测量了数千个基因的表达。基因表达的模式经常通过潜在的生物学原理进行回顾性解释。在这里,我们提出一种使用文本分析的方法,以帮助找到与科学文献中描述的基础生物学相关的有意义的基因表达模式。主要的挑战是有关单个基因的文献不均一,可能涉及该基因许多不相关的方面。在本文的第一部分中,我们介绍并评估每个基因的邻居散度(NDPG)方法,该方法将分数分配给给定的基因子组,以指示基因共享生物学特性或功能的可能性。为此,它仅使用将基因连接到文档的参考索引以及包括这些文档的语料库。在本文的第二部分中,我们提出一种优化分离投影(OSP)的方法,以在基因表达数据中搜索线性投影,以将功能相关的基因组与其余基因分开;我们搜索的目标功能是正向预测基因的NDPG得分。因此,成功的搜索应确定与有意义的生物学相关的基因表达数据中的模式。我们将OSP应用于已发布的基因表达数据集;它发现了许多生物学相关的预测。由于该方法只需要使用文本文档(文献)对实体(基因)进行数值测量(在这种情况下为表达式),因此我们推测此方法可以轻松地转移到其他领域。即使每个实体的文档属于彼此不相关的许多不同主题,该方法也应该能够识别相关模式。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号