首页> 外文会议>International Workshop on Privacy, Security, and Trust in KDD >Privacy-Preserving Sharing ofHorizontally-Distributed Private Data for Constructing Accurate Classifiers
【24h】

Privacy-Preserving Sharing ofHorizontally-Distributed Private Data for Constructing Accurate Classifiers

机译:保护私有数据的隐私保留共享用于构建准确分类器的私人数据

获取原文

摘要

Data mining tasks such as supervised classification can often benefit from a large training dataset. However, in many application domains, privacy concerns can hinder the construction of an accurate classifier by combining datasets from multiple sites. In this work, we propose a novel privacy-preserving distributed data sanitization algorithm that randomizes the private data at each site independently before the data is pooled to form a classifier at a centralized site. Distance-preserving perturbation approaches have been proposed by other researchers but we show that they can be susceptible to security risks. To enhance security, we require a unique non-distance-preserving approach. We use Kernel Density Estimation (KDE) Resampling, where samples are drawn independently from a distribution that is approximately equal to the original data's distribution. KDE Resampling provides consistent density estimates with randomized samples that are asymptotically independent of the original samples. This ensures high accuracy, especially when a large number of samples is available, with low privacy loss. We evaluated our approach on five standard datasets in a distributed setting using three different classifiers. The classification errors only deteriorated by 3% (in the worst case) when we used the randomized data instead of the original private data. With a large number of samples, KDE Resampling effectively preserves privacy (due to the asymptotic independence property) and also maintains the necessary data integrity for constructing accurate classifiers (due to consistency).
机译:数据挖掘任务如监督分类通常可以从大型训练数据集中受益。但是,在许多应用领域中,隐私问题可以通过组合来自多个站点的数据集来阻碍准确分类器的构造。在这项工作中,我们提出了一种新的隐私保留的分布式数据消毒算法,它在汇集数据之前独立地在每个站点上随机化私有数据,以在集中站点在集中站点形成分类器。其他研究人员提出了远程扰动方法,但我们表明它们可能易于安全风险。为了提高安全性,我们需要一种独特的非距离保存方法。我们使用内核密度估计(KDE)重采样,其中样本独立于近似等于原始数据分布的分布绘制。 KDE重采样提供一致的密度估计,随机样品呈渐近样本无关。这确保了高精度,特别是当大量样品可用时,隐私损失低。我们在使用三个不同的分类器的分布式设置中评估了我们在五个标准数据集中的方法。当我们使用随机数据而不是原始私有数据时,分类错误仅在3%(最坏情况下)恶化。通过大量样本,KDE重新采样有效地保留了隐私(由于渐近独立性),并且还保持了用于构建精确分类器的必要数据完整性(由于一致性)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号