Moment-based density estimation of confidential micro-data: a computational statistics approach

Bradley Wakefield; Yan-Xia Lin; Rathin SarathyKrishnamurty Muralidhar

摘要

Abstract Providing access to synthetic micro-data in place of confidential data to protect the privacy of participants is common practice. For the synthetic data to be useful for analysis, it is necessary that the density function of the synthetic data closely approximate the confidential data. Hence, accurately estimating the density function based on sample micro-data is important. Existing kernel-based, copula-based, and machine learning methods of joint density estimation may not be viable. Applying the multivariate moments’ problem to sample-based density estimation has long been considered impractical due to the computational complexity and intractability of optimal parameter selection of the density estimate when the true joint density function is unknown. This paper introduces a generalised form of the sample moment-based density estimate, which can be used to estimate joint density functions when only the information of empirical moments is available. We demonstrate optimal parametrisation of the moment-based density estimate based solely on sample data by employing a computational strategy for parameter selection. We compare the performance of the moment-based estimate to that of existing non-parametric and parametric density estimation methods. The results show that using empirical moments can provide a reasonable, robust non-parametric approximation of a joint density function that is comparable to existing non-parametric methods. We provide an example of synthetic data generation from the moment-based density estimate and show that the resulting synthetic data provides a reasonable disclosure-protected alternative for public release.

机译：摘要提供对合成微观数据的访问以代替机密数据以保护参与者的隐私是常见的做法。为了使合成数据可用于分析，合成数据的密度函数必须与机密数据非常接近。因此，基于样品微观数据准确估计密度函数非常重要。现有的基于核、基于核和机器学习的联合密度估计方法可能不可行。长期以来，将多元矩问题应用于基于样本的密度估计被认为是不切实际的，因为当真正的联合密度函数未知时，密度估计的最优参数选择具有计算复杂性和棘手性。该文介绍了一种基于样本矩的密度估计的广义形式，可用于仅提供经验矩信息时估计联合密度函数。我们通过采用参数选择的计算策略，证明了仅基于样本数据的基于矩的密度估计的最优参数化。我们将基于矩的估计与现有的非参数和参数密度估计方法的性能进行了比较。结果表明，使用经验矩可以提供与现有非参数方法相当的联合密度函数的合理、鲁棒的非参数近似。我们提供了一个从基于矩的密度估计生成合成数据的例子，并表明由此产生的合成数据为公开发布提供了一个合理的、受披露保护的替代方案。

Moment-based density estimation of confidential micro-data: a computational statistics approach

摘要

著录项

引文网络

相关主题

期刊订阅