...
首页> 外文期刊>Artificial intelligence in medicine >Dealing with confounders and outliers in classification medical studies: The Autism Spectrum Disorders case study
【24h】

Dealing with confounders and outliers in classification medical studies: The Autism Spectrum Disorders case study

机译:在分类医学研究中处理混乱和异常值:自闭症谱系障碍案例研究

获取原文
获取原文并翻译 | 示例
           

摘要

Machine learning (ML) approaches have been widely applied to medical data in order to find reliable classifiers to improve diagnosis and detect candidate biomarkers of a disease. However, as a powerful, multivariate, data-driven approach, ML can be misled by biases and outliers in the training set, finding sample-dependent classification patterns. This phenomenon often occurs in biomedical applications in which, due to the scarcity of the data, combined with their heterogeneous nature and complex acquisition process, outliers and biases are very common. In this work we present a new workflow for biomedical research based on ML approaches, that maximizes the generalizability of the classification. This workflow is based on the adoption of two data selection tools: an autoencoder to identify the outliers and the Confounding Index, to understand which characteristics of the sample can mislead classification. As a study-case we adopt the controversial research about extracting brain structural biomarkers of Autism Spectrum Disorders (ASD) from magnetic resonance images. A classifier trained on a dataset composed by 86 subjects, selected using this framework, obtained an area under the receiver operating characteristic curve of 0.79. The feature pattern identified by this classifier is still able to capture the mean differences between the ASD and Typically Developing Control classes on 1460 new subjects in the same age range of the training set, thus providing new insights on the brain characteristics of ASD. In this work, we show that the proposed workflow allows to find generalizable patterns even if the dataset is limited, while skipping the two mentioned steps and using a larger but not well designed training set would have produced a sample-dependent classifier.
机译:机器学习(ML)方法已被广泛应用于医疗数据,以寻找可靠的分类剂以改善诊断和检测疾病的候选生物标志物。然而,作为强大的,多变量,数据驱动的方法,ML可以通过训练集中的偏差和异常值误导,找到类似于样本相关的分类模式。这种现象通常发生在生物医学应用中,其中,由于数据的稀缺性,与其异质性质和复杂的采集过程结合,异常值和偏差是非常常见的。在这项工作中,我们为基于ML方法的生物医学研究提出了新的工作流程,最大限度地提高了分类的概括性。此工作流程基于采用两个数据选择工具:自动码器以识别异常值和混淆索引,以了解样本的哪些特性可以误导分类。作为一项研究 - 案例,我们采用了关于从磁共振图像中提取自闭症谱系障碍(ASD)的脑结构生物标志物的争议研究。在使用此框架选择的86个主题组成的数据集上培训的分类器在接收器操作特性曲线下获得了0.79的区域。该分类器识别的特征模式仍然能够在训练集的同一年度范围内的1460个新对象上捕获ASD和通常开发控制类之间的平均差异,从而为ASD的大脑特征提供新的见解。在这项工作中,我们表明,所提出的工作流程允许找到一个更广泛的模式,即使数据集是有限的,在跳过两个提到的步骤并使用更大但没有精心设计的训练集将产生类似于样本相关的分类器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号