首页> 外文期刊>Computational statistics & data analysis >Testing for outliers from a mixture distribution when some data are missing
【24h】

Testing for outliers from a mixture distribution when some data are missing

机译:当缺少某些数据时,从混合物分布中测试异常值

获取原文
获取原文并翻译 | 示例
           

摘要

The problem being considered is that of multivariate outlier testing from a population from which a training sample is available. A new observation is obtained and is tested to determine whether it is from the population of the training sample. Problems of this type arise in a number of applications including nuclear monitoring, biometrics (including fingerprint and handwriting identification), and medical diagnosis. In many cases it is reasonable to model the population of the training sample using a mixture-of-normals model (e.g. when the observations come from a variety of sources or the data are substantially non-normal). A modified likelihood ratio test is considered that is applicable to the case in which: (a) the training data follow a mixture-of-normals distribution, (b) all labels in the training sample are missing, (c) some of the observation vectors in the training sample have missing information, and (d) the number of components in the mixture is unknown. The approach often used in practice when some of the data vectors have missing observations is to perform the test based only on the data vectors with full data. When large amounts of data are missing, use of this strategy may lead to loss of valuable information, especially in the case of small training samples which, for example, is often the case in the nuclear monitoring setting. This paper discusses an alternative procedure that incorporates all n of the data vectors using the expectation-maximization (EM) algorithm to handle the missing data. Simulations and examples are used to compare the use of the EM algorithm on the entire data set with the use of only the complete data vectors.
机译:所考虑的问题是来自可提供训练样本的总体的多元离群检验。获得一个新的观察值,并对其进行测试以确定它是否来自训练样本的总体。这种类型的问题出现在许多应用中,包括核监测,生物识别(包括指纹和手写识别)以及医学诊断。在许多情况下,使用正态混合模型对训练样本的种群进行建模是合理的(例如,当观察值来自各种来源或数据基本非正态时)。认为修改后的似然比检验适用于以下情况:(a)训练数据遵循正态分布混合,(b)训练样本中的所有标记均缺失,(c)一些观察结果训练样本中的向量缺少信息,并且(d)混合物中的成分数未知。当某些数据向量缺少观测值时,实践中通常使用的方法是仅基于具有完整数据的数据向量执行测试。当缺少大量数据时,使用此策略可能会导致有价值的信息丢失,尤其是在训练样本较小的情况下,例如在核监测环境中经常会出现这种情况。本文讨论了一种替代过程,该过程使用期望最大化(EM)算法合并所有n个数据向量,以处理丢失的数据。仿真和示例用于将EM算法在整个数据集上的使用与仅在完整数据向量上的使用进行比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号