首页> 外文期刊>Synthetic and Systems Biotechnology >The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer
【24h】

The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

机译:基于k-mer的聚合统计量对水平基因转移的无比对检测的统计能力

获取原文
           

摘要

Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k -mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k -mer statistics for HGT detection, we developed two aggregative statistics T s u m S and T s u m * , which subsample metagenome contigs by their representative regions, and summarize the regional D 2 S and D 2 * metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k -mer size using simulations. Our analysis showed that, in general, the power of T s u m S and T s u m * increases with sequencing coverage, and reaches a maximum power 80% at k ?=?6, with 5% Type-I error and the coverage ratio 0.2x. The statistical power of T s u m S and T s u m * was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.
机译:基于比对的数据库搜索和序列比较通常用于检测水平基因转移(HGT)。然而,随着测序深度的迅速增加,从宏基因组学研究中常规地组装了成千上万个重叠群,这通过使已知参考序列不堪重负,对基于比对的HGT分析提出了挑战。因此,通过k聚体统计检测HGT成为有吸引力的选择。这些免比对的统计数据已在全基因组和转录组比较中得到了高性能和高效率的证明。为了使k-mer统计数据适合HGT检测,我们开发了两个汇总统计数据T s u m S和T s u m *,通过代表区域对子基因组重叠群进行二次抽样,并通过其上限来总结区域D 2 S和D 2 *度量。我们使用模拟系统地研究了不同k聚体大小下的汇总统计量。我们的分析表明,通常,T sum S和T sum *的幂随测序覆盖率而增加,并且在k?=?6时达到最大功率> 80%,I型误差为5%,覆盖率> 0.2倍通过真实模拟HGT机制,测序深度,读取长度和碱基误差,评估了T s m S和T s m *的统计功效。我们希望这些统计数据对于在宏基因组学研究中识别HGT有用的距离指标。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号