Testing for outliers from a mixture distribution when some data are missing

Wayne A. Woodward; Stephan R. Sain

首页> 外文期刊>Computational statistics & data analysis >Testing for outliers from a mixture distribution when some data are missing

【24h】

Testing for outliers from a mixture distribution when some data are missing

机译：当缺少某些数据时，从混合物分布中测试异常值

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The problem being considered is that of multivariate outlier testing from a population from which a training sample is available. A new observation is obtained and is tested to determine whether it is from the population of the training sample. Problems of this type arise in a number of applications including nuclear monitoring, biometrics (including fingerprint and handwriting identification), and medical diagnosis. In many cases it is reasonable to model the population of the training sample using a mixture-of-normals model (e.g. when the observations come from a variety of sources or the data are substantially non-normal). A modified likelihood ratio test is considered that is applicable to the case in which: (a) the training data follow a mixture-of-normals distribution, (b) all labels in the training sample are missing, (c) some of the observation vectors in the training sample have missing information, and (d) the number of components in the mixture is unknown. The approach often used in practice when some of the data vectors have missing observations is to perform the test based only on the data vectors with full data. When large amounts of data are missing, use of this strategy may lead to loss of valuable information, especially in the case of small training samples which, for example, is often the case in the nuclear monitoring setting. This paper discusses an alternative procedure that incorporates all n of the data vectors using the expectation-maximization (EM) algorithm to handle the missing data. Simulations and examples are used to compare the use of the EM algorithm on the entire data set with the use of only the complete data vectors.

机译：所考虑的问题是来自可提供训练样本的总体的多元离群检验。获得一个新的观察值，并对其进行测试以确定它是否来自训练样本的总体。这种类型的问题出现在许多应用中，包括核监测，生物识别（包括指纹和手写识别）以及医学诊断。在许多情况下，使用正态混合模型对训练样本的种群进行建模是合理的（例如，当观察值来自各种来源或数据基本非正态时）。认为修改后的似然比检验适用于以下情况：（a）训练数据遵循正态分布混合，（b）训练样本中的所有标记均缺失，（c）一些观察结果训练样本中的向量缺少信息，并且（d）混合物中的成分数未知。当某些数据向量缺少观测值时，实践中通常使用的方法是仅基于具有完整数据的数据向量执行测试。当缺少大量数据时，使用此策略可能会导致有价值的信息丢失，尤其是在训练样本较小的情况下，例如在核监测环境中经常会出现这种情况。本文讨论了一种替代过程，该过程使用期望最大化（EM）算法合并所有n个数据向量，以处理丢失的数据。仿真和示例用于将EM算法在整个数据集上的使用与仅在完整数据向量上的使用进行比较。

著录项

来源
《Computational statistics & data analysis》 |2003年第2期|共18页
作者
Wayne A. Woodward; Stephan R. Sain;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
EM algorithm; mixture model; missing data; outlier detection;

机译：EM算法;混合模型;数据丢失;离群值检测;

相似文献

外文文献
中文文献
专利

1. Testing for outliers from a mixture distribution when some data are missing [J] . Wayne A. Woodward, Stephan R. Sain Quality Control and Applied Statistics . 2004,第4期

机译：当缺少某些数据时，从混合物分布中测试异常值
2. A Test Detecting the Outliers for Continuous Distributions Based on the Cumulative Distribution Function of the Data Being Tested [J] . Lorentz J?ntschi Symmetry . 2019,第6期

机译：基于被测数据的累积分布函数的连续分布离群值检测
3. A Test Detecting the Outliers for Continuous Distributions Based on the Cumulative Distribution Function of the Data Being Tested [J] . Lorentz J?ntschi Symmetry . 2019,第6期

机译：基于被测数据的累积分布函数的连续分布离群值检测
4. Nonlinear estimation of missing #x0394;LSF parameters by a mixture of Dirichlet distributions [C] . Ma Zhanyu, Martin Rainer, Guo Jun, IEEE International Conference on Acoustics, Speech and Signal Processing . 2014

机译：混合Dirichlet分布对缺失ΔLSF参数的非线性估计
5. Analyzing data sets with a mixture of MAR and NINR missing data: Assessing the impact of sample size and proportion missing on the estimates [D] . Meleth, Sreelatha. 2001

机译：分析包含MAR和NINR缺失数据的数据集：评估样本大小和缺失比例对估计值的影响
6. Universal Linear Fit Identification: A Method Independent of Data Outliers and Noise Distribution Model and Free of Missing or Removed Data Imputation [O] . K. K. L. B. Adikaram, M. A. Hussein, M. Effenberger, -1

机译：通用线性拟合识别：一种独立于数据离群值和噪声分布模型且无缺失或缺失数据插补的方法
7. A Test Detecting the Outliers for Continuous Distributions Based on the Cumulative Distribution Function of the Data Being Tested [O] . Lorentz Jäntschi 2019

机译：基于正在测试的数据的累积分布函数的累积分布函数来检测连续分布的异常值的测试

Testing for outliers from a mixture distribution when some data are missing

摘要

著录项

相似文献

相关主题

期刊订阅