首页> 外文期刊>Journal of Big Data >A non-parametric maximum for number of selected features: objective optima for FDR and significance threshold with application to ordinal survey analysis
【24h】

A non-parametric maximum for number of selected features: objective optima for FDR and significance threshold with application to ordinal survey analysis

机译:所选要素数量的非参数最大值:FDR的客观最优值和显着性阈值,可应用于有序调查分析

获取原文
           

摘要

Abstract This paper identifies a criterion for choosing an optimum set of selected features, or rejected null hypotheses, in high-dimensional data analysis. The method is designed for dimension reduction with multiple hypothesis testing used in filtering process of big data, and in exploratory research, to identify significant associations among many predictor variables and few outcomes. The novelty of the proposed method is that the selected p-value threshold will be insensitive to dependency within features, and between features and outcome. The method neither requires predetermined thresholds for level of significance, nor uses presumed thresholds for false discovery rate. Using the presented method, the optimum p-value for powerful yet parsimonious model is chosen, then for every set of rejected hypotheses, the researcher can also report traditional measures of statistical accuracy such as the expected number of false positives, and false discovery rate. The upper limit for number of rejected hypotheses (or selected features) is determined by finding the maximum difference between expected true hypotheses and expected false hypotheses among all possible sets of rejected hypotheses. Then, many methods of choosing an optimum number of selected features such as piecewise regression are used to form a parsimonious model. The paper reports the results of implementation of proposed methods in a novel example of non-parametric analysis of high-dimensional ordinal survey data.
机译:摘要本文确定了在高维数据分析中选择最优选择特征集或拒绝零假设的准则。该方法旨在通过在大数据过滤过程中和探索性研究中使用的多个假设检验来减少维度,以识别许多预测变量和很少结果之间的显着关联。所提出的方法的新颖性在于所选的p值阈值将对特征内以及特征与结果之间的依赖性不敏感。该方法既不需要用于重要性水平的预定阈值,也不需要用于错误发现率的假定阈值。使用提出的方法,选择功能强大但简约的模型的最佳p值,然后针对每组被拒绝的假设,研究人员还可以报告传统的统计准确性度量,例如预期的假阳性数和假发现率。拒绝假设(或选定特征)数量的上限是通过在所有可能的拒绝假设集合中找到期望的真实假设和期望的错误假设之间的最大差异来确定的。然后,许多选择最佳数量的选定特征的方法(例如分段回归)被用于形成简约模型。本文在高维序数调查数据的非参数分析的一个新示例中报告了所提出方法的实施结果。

著录项

相似文献

  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号