...
首页> 外文期刊>BMC Bioinformatics >Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach
【24h】

Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach

机译:基于单核苷酸多态性的多分类器样本量的确定:表面法下的体积

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Background Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual’s class membership to his/her risk of developing a disease. In multi-class classification scenarios, clinical samples are often limited due to cost constraints, making it necessary to determine the sample size needed to build an accurate classifier based on SNPs. The performance of such classifiers can be assessed using the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) for two classes and the Volume Under the ROC hyper-Surface (VUS) for three or more classes. Sample size determination based on AUC or VUS would not only guarantee an overall correct classification rate, but also make studies more cost-effective. Results For coded SNP data from D(≥2) classes, we derive an optimal Bayes classifier and a linear classifier, and obtain a normal approximation to the probability of correct classification for each classifier. These approximations are then used to evaluate the associated AUCs or VUSs, whose accuracies are validated using Monte Carlo simulations. We give a sample size determination method, which ensures that the difference between the two approximate AUCs (or VUSs) is below a pre-specified threshold. The performance of our sample size determination method is then illustrated via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent SNPs and the required total sample sizes are determined for a continuum of threshold values. In all, four different sample size determination studies are conducted with the HapMap data, covering cases involving well-separated populations to poorly-separated ones. Conclusion For multi-classes, we have developed a sample size determination methodology and illustrated its usefulness in obtaining a required sample size from the estimated learning curve. For classification scenarios, this methodology will help scientists determine whether a sample at hand is adequate or more samples are required to achieve a pre-specified accuracy. A PDF manual for R package “SampleSizeSNP” is given in Additional file 1 , and a ZIP file of the R package “SampleSizeSNP” is given in Additional file 2 .
机译:已经发现有关单核苷酸多态性(SNP)的背景数据可用于预测从个体的阶级成员到患病风险的表型。在多类分类方案中,由于成本限制,临床样品通常受到限制,因此有必要确定构建基于SNP的准确分类器所需的样品量。可以使用两个类别的接收器工作特性(ROC)曲线下面积(AUC)和三个或三个以上类别的ROC超表面下音量(VUS)评估此类分类器的性能。基于AUC或VUS的样本量确定不仅可以确保总体正确的分类率,而且可以使研究更具成本效益。结果对于来自D(≥2)类的编码SNP数据,我们导出了一个最佳贝叶斯分类器和一个线性分类器,并获得了每个分类器正确分类概率的正态近似值。然后,将这些近似值用于评估关联的AUC或VUS,其精度已使用蒙特卡洛模拟进行了验证。我们提供了一种样本大小确定方法,该方法可以确保两个近似AUC(或VUS)之间的差值低于预先指定的阈值。然后通过仿真说明我们样本量确定方法的性能。对于具有三个和四个总体的HapMap数据,使用92个独立的SNP构建线性分类器,并为连续的阈值确定所需的总样本大小。总体而言,使用HapMap数据进行了四个不同的样本量确定研究,覆盖了涉及到人口充分隔离到人口分散的案例。结论对于多类别的研究,我们开发了一种样本量确定方法,并说明了其在从估计的学习曲线中获得所需样本量的有用性。对于分类方案,此方法将帮助科学家确定手头样品是否足够或需要更多样品才能达到预先指定的准确性。 R包“ SampleSizeSNP”的PDF手册在附加文件1中给出,R包“ SampleSizeSNP”的ZIP文件在附加文件2中给出。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号