...
首页> 外文期刊>Statistics in medicine >Clustering and variable selection in the presence of mixed variable types and missing data
【24h】

Clustering and variable selection in the presence of mixed variable types and missing data

机译:混合变量类型和缺少数据存在的聚类和变量选择

获取原文
获取原文并翻译 | 示例
           

摘要

We consider the problem of model‐based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.
机译:我们考虑在许多相关,混合连续和离散变量存在下模型的聚类问题,其中一些可能具有缺失的值。通过潜在连续可变方法处理离散变量,并且Dirichlet方法用于构建具有未知数量的组件的混合模型。还执行可变选择以识别确定集群成员资格最有影响力的变量。这项工作是由于需要在许多认知和/或行为考试评分的基础上认为患者患者认为患者潜在的患者。数据集中有一个适度的患者(486)以及许多(55)的测试得分变量(其中许多是离散值和/或丢失的)。该工作的目标是(1)将这些患者聚集成类似的组,以帮助识别具有类似临床表现的那些,并且(2)识别通知群集的稀疏子集,以便消除不必要的测试。通过模拟这种类型的问题,所提出的方法与其他方法非常有利。自闭症谱系障碍分析的结果表明,最有可能的3个集群,而只有4个测试分数高(& 0.5)的缺陷概率很高。这将导致更有效和信息丰富的测试。需要基于许多相关,连续/离散变量的群体观察的需要是健康科学的常见问题以及许多其他学科。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号