...
首页> 外文期刊>BMC Bioinformatics >ALE: automated label extraction from GEO metadata
【24h】

ALE: automated label extraction from GEO metadata

机译:ALE:从GEO元数据中自动提取标签

获取原文
   

获取外文期刊封面封底 >>

       

摘要

NCBI’s Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning labels to the experiments. Automated approaches are preferable for this, primarily because of the size and volume of the data to be processed, but also because it ensures standardization and consistency. While some of these labels can be extracted directly from the textual metadata, many of the data available do not contain explicit text informing the researcher about the age and gender of the subjects with the study. To bridge this gap, machine-learning methods can be trained to use the gene expression patterns associated with the text-derived labels to refine label-prediction confidence. Our analysis shows only 26% of metadata text contains information about gender and 21% about age. In order to ameliorate the lack of available labels for these data sets, we first extract labels from the textual metadata for each GEO RNA dataset and evaluate the performance against a gold standard of manually curated labels. We then use machine-learning methods to predict labels, based upon gene expression of the samples and compare this to the text-based method. Here we present an automated method to extract labels for age, gender, and tissue from textual metadata and GEO data using both a heuristic approach as well as machine learning. We show the two methods together improve accuracy of label assignment to GEO samples.
机译:NCBI的基因表达综合(GEO)是一个丰富的社区资源,其中包含来自人类,小鼠,大鼠和其他模型生物的数百万个基因表达实验。但是,有关每个实验(元数据)的信息均采用保存者提供的开放式,非标准化文本描述的格式。因此,在不给实验分配标签的情况下,按性别,样品供体的年龄和起源组织等因素对荟萃分析进行实验分类是不可行的。为此,首选自动化方法,主要是因为要处理的数据的大小和量,而且还因为它确保了标准化和一致性。尽管其中一些标签可以直接从文本元数据中提取,但许多可用数据并不包含明确的文本,这些文本会告知研究人员该研究对象的年龄和性别。为了弥合这一差距,可以训练机器学习方法,以使用与文本标签关联的基因表达模式来完善标签预测的置信度。我们的分析显示,只有26%的元数据文本包含有关性别的信息,而21%的年龄有关。为了缓解这些数据集缺少可用标签的问题,我们首先从每个GEO RNA数据集的文本元数据中提取了标签,然后根据手动整理标签的黄金标准评估了性能。然后,我们基于样本的基因表达,使用机器学习方法来预测标签,并将其与基于文本的方法进行比较。在这里,我们提出了一种自动方法,该方法使用启发式方法和机器学习方法从文本元数据和GEO数据中提取年龄,性别和组织的标签。我们展示了这两种方法一起提高了对GEO样本的标签分配的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号