...
首页> 外文期刊>Journal of Statistical Planning and Inference >Significance analysis of high-dimensional, low-sample size partially labeled data
【24h】

Significance analysis of high-dimensional, low-sample size partially labeled data

机译:高维,低样本量的部分标记数据的意义分析

获取原文
获取原文并翻译 | 示例
           

摘要

Classification and clustering are both important topics in statistical learning. A natural question herein is whether predefined classes are really different from one another, or whether clusters are really there. Specifically, we may be interested in knowing whether the two classes defined by some class labels (when they are provided), or the two clusters tagged by a clustering algorithm (where class labels are not provided), are from the same underlying distribution. Although both are challenging questions for the high-dimensional, low-sample size data, there has been some recent development for both. However, when it is costly to manually place labels on observations, it is often that only a small portion of the class labels is available. In this article, we propose a significance analysis method for such type of data, namely partially labeled data. Our method makes use of the whole data and tries to test the class difference as if all the labels were observed. Compared to a testing method that ignores the label information, our method provides a greater power, meanwhile, maintaining the size, illustrated by a comprehensive simulation study. Theoretical properties of the proposed method are studied with emphasis on the high dimensional, low-sample size setting. Our simulated examples help to understand when and how the information extracted from the labeled data can be effective. A real data example further illustrates the usefulness of the proposed method. (C) 2016 Elsevier B.V. All rights reserved.
机译:分类和聚类都是统计学习中的重要主题。这里的一个自然问题是,预定义的类是否真的彼此不同,或者群集是否真的存在。具体来说,我们可能想知道某个类标签定义的两个类(如果提供了它们),或者由聚类算法标记的两个群集(未提供类标签)是否来自相同的基础分布。尽管对于高维,低样本量的数据,这两者都是具有挑战性的问题,但两者都有一些最新进展。但是,在手动将标签放置在观测上的成本很高时,通常只有一小部分类别标签可用。在本文中,我们提出了针对此类数据(即部分标记的数据)的重要性分析方法。我们的方法利用了整个数据,并尝试测试类差异,就像观察到所有标签一样。与忽略标签信息的测试方法相比,我们的方法提供了更大的功效,同时又保持了尺寸,这由全面的仿真研究表明。研究方法的理论性质,重点是高维,低样本量的设置。我们的模拟示例有助于理解何时以及如何从标记数据中提取信息。实际数据示例进一步说明了该方法的实用性。 (C)2016 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号