首页> 外文期刊>Database >Predicting structured metadata from unstructured metadata
【24h】

Predicting structured metadata from unstructured metadata

机译:从非结构化元数据预测结构化元数据

获取原文
           

摘要

Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data—defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction. Database URL: http://www.yeastgenome.org/
机译:全世界的研究人员已经并且正在产生大量的生物医学数据。但是,数据重用的一个关键和限制因素是对数据或关于数据的数据(定义为元数据)的准确,结构化和完整的描述。我们提出了一个框架,可使用基因表达综合总线(GEO)微阵列数据库从非结构化元数据预测结构化元数据术语,以提高元数据的质量和数量。我们的框架包括使用术语频率反文档频率(TF-IDF)功能进行训练的分类器,以及基于使用潜在狄利克雷分配模型(LDA)建模以降低非结构化数据维数的主题的第二种方法。我们在GEO数据库中得到的结果表明,使用TF-IDF方法可以最准确地预测结构化的元数据术语,其次是LDA都优于多数投票基准。虽然LDA的维数减少会损失一些准确性,但对于可能值很少的元素而言,差异很小,并且与多数分类器基准相比有很大的改进。总体而言,这是用于元数据预测的有前途的方法,该方法可能适用于其他数据集,并且对对生物医学元数据管理和元数据预测感兴趣的研究人员产生了影响。数据库网址:http://www.yeastgenome.org/

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号