...
首页> 外文期刊>Planta: An International Journal of Plant Biology >Performance comparison of gene family clustering methods with expert curated gene family data set in Arabidopsis thaliana
【24h】

Performance comparison of gene family clustering methods with expert curated gene family data set in Arabidopsis thaliana

机译:基因家族聚类方法与专家精选基因家族数据集在拟南芥中的性能比较

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

With the exponential growth of genomics data, the demand for reliable clustering methods is increasing every day. Despite the wide usage of many clustering algorithms, the accuracy of these algorithms has been evaluated mostly on simulated data sets and seldom on real biological data for which a "correct answer" is available. In order to address this issue, we use the manually curated high-quality Arabidopsis thaliana gene family database as a "gold standard" to conduct a comprehensive comparison of the accuracies of four widely used clustering methods including K-means, TribeMCL, single-linkage clustering and complete-linkage clustering. We compare the results from running different clustering methods on two matrices: the E-value matrix and the k-tuple distance matrix. The E-value matrix is computed based on BLAST E-values. The k-tuple distance matrix is computed based on the difference in tuple frequencies. The TribeMCL with the E-value matrix performed best, with the Inflation parameter (=1.15) tuned considerably lower than what has been suggested previously (=2). The single-linkage clustering method with the E-value matrix was second best. Single-linkage clustering, K-means clustering, complete-linkage clustering, and TribeMCL with a k-tuple distance matrix performed reasonably well. Complete-linkage clustering with the k-tuple distance matrix performed the worst.
机译:随着基因组数据的指数增长,对可靠的聚类方法的需求每天都在增加。尽管许多聚类算法被广泛使用,但是这些算法的准确性主要是在模拟数据集上评估的,很少在具有“正确答案”的真实生物学数据上进行评估。为了解决这个问题,我们使用人工策划的高质量拟南芥基因家族数据库作为“金标准”,对四种广泛使用的聚类方法(包括K-means,TribeMCL,单链)的准确性进行了全面比较。集群和完全链接集群。我们比较了在两个矩阵上运行不同聚类方法的结果:E值矩阵和k元组距离矩阵。 E值矩阵是根据BLAST E值计算的。 k元组距离矩阵是基于元组频率的差异计算的。带有E值矩阵的TribeMCL表现最佳,其通货膨胀参数(= 1.15)的调整幅度明显低于先前建议的(= 2)。具有E值矩阵的单链接聚类方法排名第二。具有k元组距离矩阵的单链接聚类,K均值聚类,完全链接聚类和TribeMCL表现良好。具有k元组距离矩阵的完全链接聚类表现最差。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号