...
首页> 外文期刊>Journal of Biomedical Semantics >TopFed: TCGA tailored federated query processing and linking to LOD
【24h】

TopFed: TCGA tailored federated query processing and linking to LOD

机译:TopFed:TCGA量身定制的联合查询处理并链接到LOD

获取原文

摘要

Backgroud The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis. Methods We address these issues by transforming the TCGA data into the Semantic Web standard Resource Description Format (RDF), link it to relevant datasets in the Linked Open Data (LOD) cloud and further propose an efficient data distribution strategy to host the resulting 20.4 billion triples data via several SPARQL endpoints. Having the TCGA data distributed across multiple SPARQL endpoints, we enable biomedical scientists to query and retrieve information from these SPARQL endpoints by proposing a TCGA tailored federated SPARQL query processing engine named TopFed. Results We compare TopFed with a well established federation engine FedX in terms of source selection and query execution time by using 10 different federated SPARQL queries with varying requirements. Our evaluation results show that TopFed selects on average less than half of the sources (with 100% recall) with query execution time equal to one third to that of FedX. Conclusion With TopFed, we aim to offer biomedical scientists a single-point-of-access through which distributed TCGA data can be accessed in unison. We believe the proposed system can greatly help researchers in the biomedical domain to carry out their research effectively with TCGA as the amount and diversity of data exceeds the ability of local resources to handle its retrieval and parsing.
机译:背景技术癌症基因组图谱(TCGA)是一项多学科,多机构的研究,旨在利用基因组分析技术对导致癌症的遗传突变进行分类。该项目的目的之一是创建一个全面而开放的癌症相关分子分析资料库,生物信息学家将利用该资料库来增进癌症知识。但是,设计生物信息学应用程序以分析如此大的数据集仍然是一项挑战,因为它通常需要下载大型档案并解析相关的文本文件。因此,难以进行虚拟数据集成以收集分析所需的关键协变量。方法我们通过将TCGA数据转换为语义Web标准资源描述格式(RDF),将其链接到链接的开放数据(LOD)云中的相关数据集来解决这些问题,并进一步提出了一种有效的数据分发策略来托管由此产生的204亿通过多个SPARQL端点将数据增加三倍。通过将TCGA数据分布在多个SPARQL端点上,我们通过提出名为TopFed的TCGA量身定制的联合SPARQL查询处理引擎,使生物医学科学家能够从这些SPARQL端点查询和检索信息。结果我们通过使用10个具有不同要求的联合SPARQL查询,在来源选择和查询执行时间方面将TopFed与完善的联合引擎FedX进行了比较。我们的评估结果表明,TopFed平均选择的源少于一半(具有100%的召回率),查询执行时间是FedX的三分之一。结论借助TopFed,我们旨在为生物医学科学家提供单点访问,从而可以一致地访问分布式TCGA数据。我们相信,由于数据量和多样性超过了本地资源处理其检索和解析的能力,因此该提议的系统可以极大地帮助生物医学领域的研究人员有效地利用TCGA进行研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号