首页> 外文期刊>Future generation computer systems >Efficient, robust and effective rank aggregation for massive biological datasets
【24h】

Efficient, robust and effective rank aggregation for massive biological datasets

机译:大规模生物数据集的高效,稳健和有效等级聚集

获取原文
获取原文并翻译 | 示例

摘要

Massive biological datasets are available in various sources. To answer a biological question (e.g., "which are the genes involved in a given disease?"), life scientists query and mine such datasets using various techniques. Each technique provides a list of results usually ranked by importance (e.g., a list of ranked genes). Combining the results obtained by various techniques, that is, combining ranked lists of elements into one list of elements is of paramount importance to help life scientists make the most of various results and prioritize further investigations. Rank aggregation techniques are particularly well-fitted with this context as they take in a set of rankings and provide a consensus, that is, a single ranking which is the "closest" to the input rankings. However, (ⅰ) the problem of rank aggregation is NP-hard in most cases (using an exact algorithm is currently not possible for more than a few dozens of elements) and (ⅱ) several (possibly very different) exact solutions can be obtained. As answer to (ⅰ), many heuristics and approximation algorithms have been proposed. However, heuristics cannot guarantee how far from an exact solution the consensus ranking will be, and the approximation ratio of approximation algorithms dedicated to the problem is fairly high (not less than 3/2). No solution has yet been proposed to help true-users dealing with the problem encountered in point (ⅱ). In this paper we present a complete system able to perform rank aggregation of massive biological datasets. Our solution is efficient as it is based on an original partitioning method making it possible to compute a high-quality consensus using an exact algorithm in a large number of cases. Our solution is robust as it clearly identifies for the user groups of elements whose relative order is the same in any optimal solution. These features provide answers to points (ⅰ) and (ⅱ) and lie in mathematical bases offering guarantees on the computed result. Also, our solution is effective as it has been implemented into a real tool, ConquR-BioV2 is used for the life science community, and evaluated at large-scale using a very large number of datasets.
机译:各种来源提供大规模生物数据集。为了回答生物问题(例如,“,这是涉及给定疾病的基因?”),使用各种技术查询和挖掘此类数据集。每种技术提供通常由重要性排名的结果列表(例如,排名基因列表)。结合各种技术获得的结果,即,将排名的元素列表组合成一个元素列表,以帮助生活科学家充分发挥各种结果并优先进一步调查。排名聚集技术特别适合于此背景,他们采用一组排名并提供共识,即单个排名,是输入排名的“最接近”。但是,(Ⅰ)在大多数情况下,等级聚集的问题是NP - 硬质 - 使用精确的算法,目前不可能超过几十个元素)和(Ⅱ)可以获得几种(可能非常不同)的精确解决方案。作为(Ⅰ)的答案,已经提出了许多启发式和近似算法。然而,启发式不能保证与确切解决方案的差别,近似算法的近似值相当高(不小于3/2)。尚未提出任何解决方案,以帮助对处理点遇到的问题的真实用户(Ⅱ)。在本文中,我们提出了一个能够执行大规模生物数据集的排名聚集的完整系统。我们的解决方案是高效的,因为它基于原始分区方法,使得可以在大量情况下使用精确的算法计算高质量共识。我们的解决方案是强大的,因为它清楚地识别用户相对顺序在任何最佳解决方案中相同的元素组。这些特征为点(Ⅰ)和(Ⅱ)提供了答案,并且在数学基础上提供了计算结果的保证。此外,我们的解决方案是有效的,因为它已经实施到真实工具中,Conqur-Biov2用于生命科学界,并使用大量数据集进行大规模评估。

著录项

  • 来源
    《Future generation computer systems》 |2021年第11期|406-421|共16页
  • 作者单位

    Universite Paris-Saclay CNRS Laboratoire Interdisciplinaire des Sciences du Numerique 91405 Orsay France;

    Hub de Bioinformatique et Biostatistique Departement Biologie Computationnelle Institut Pasteur USR 3756 CNRS Paris 75015 France;

    Universite Gustave Eiffel CNRS Laboratoire d'Informatique Gaspard-Monge Mame-la-Vallee France;

    Universite Paris-Saclay CNRS Laboratoire Interdisciplinaire des Sciences du Numerique 91405 Orsay France;

    Universite Paris-Saclay CNRS Laboratoire Interdisciplinaire des Sciences du Numerique 91405 Orsay France Universite Paris-Saclay CEA CNRS Institut de Biologie Integrative de la Cellule (I2BC) 91198 Gif-sur-Yvette France;

    Universite Paris-Saclay CNRS Laboratoire Interdisciplinaire des Sciences du Numerique 91405 Orsay France;

    Universite Gustave Eiffel CNRS Laboratoire d'Informatique Gaspard-Monge Mame-la-Vallee France;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Rank aggregation; Consensus ranking; Massive biological datasets; Kemeny rule;

    机译:排名汇总;共识排名;巨大的生物数据集;威尼斯规则;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号