首页> 外文期刊>BMC Bioinformatics >Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data
【24h】

Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data

机译:Jaccard / Tanimoto相似性测试和生物存在缺失数据的估算方法

获取原文
           

摘要

BACKGROUND:A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied.RESULTS:We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. The exact and asymptotic solutions are derived. To overcome a computational burden due to high-dimensionality, we propose the bootstrap and measurement concentration algorithms to efficiently estimate statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution, particularly with an increasing dimensionality. We showcase their applications in evaluating co-occurrences of bird species in 28 islands of Vanuatu and fish species in 3347 freshwater habitats in France. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard).CONCLUSION:We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient for binary data, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science.
机译:背景:跨多种生物地科单位(或生物导致)的对特定物种的缺陷调查用于从生态学到微生物学的生物学研究的广泛领域。使用二进制存在缺勤数据,我们评估物种共同发生,帮助阐明生物和环境之间的关系。为了总结物种出现之间的相似性,我们经常使用Jaccard / Tanimoto系数,这是它们与其联盟的交汇处的比率。然后,它是自然的,然后识别统计上显着的Jaccard / Tanimoto系数,这表明非随机的物种共同发生。然而,使用这种相似系数的统计假设测试很少使用或研究。结果:我们使用Jaccard / Tanimoto系数向生物存在性数据进行相似性介绍一个假设试验。提出了几个关键改进,包括对期望和居中的jaccard / tanimoto系数的无偏见估计,该系数占发生概率。衍生精确和渐近的溶液。为了克服由于高维度导致的计算负担,我们提出了引导和测量浓度算法,以有效地估计二元相似性的统计学意义。综合仿真研究表明,我们的提出方法可以产生准确的P值和错误的发现率。所提出的估计方法比精确的解决方案快,特别是随着维度的增加。我们在法国淡水栖息地的3347年淡水栖息地,在法国淡水栖息地的28个南瓜岛和鱼类中评估鸟类共同发生的应用。所提出的方法是在名为Jaccard的开源R包中实现的(https://cran.r-project.org/package=jaccard).conclusion:We介绍了二进制数据的Jaccard / Tanimoto相似系数的套件统计方法,这使得能够直接纳入物种共同发生的分析中的概率措施。由于其一般性,所提出的方法和实施适用于基因组学,生物化学和其他科学领域产生的广泛的二元数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号