首页> 外文会议>ACMKDD International Conference on Knowledge Discovery and Data Mining >The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing
【24h】

The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing

机译:隐私费用:在匿名数据发布中销毁数据挖掘实用程序

获取原文

摘要

Re-identification is a major privacy threat to public datasets containing individual records. Many privacy protection algorithms rely on generalization and suppression of "quasi-identifier" attributes such as ZIP code and birthdate. Their objective is usually syntactic sanitization: for example, k-anonymity requires that each "quasi-identifier" tuple appear in at least k records, while e-diversity requires that the distribution of sensitive attributes for each quasi-identifier have high entropy. The utility of sanitized data is also measured syntactically, by the number of generalization steps applied or the number of records with the same quasi-identifier. In this paper, we ask whether generalization and suppression of quasi-identifiers offer any benefits over trivial sanitization which simply separates quasi-identifiers from sensitive attributes. Previous work showed that k-anonymous databases can be useful for data mining, but kc-anonymization does not guarantee any privacy. By contrast, we measure the tradeoff between privacy (how much can the adversary learn from the sanitized records?) and utility, measured as accuracy of data-mining algorithms executed on the same sanitized records. For our experimental evaluation, we use the same datasets from the UCI machine learning repository as were used in previous research on generalization and suppression. Our results demonstrate that even modest privacy gains require almost complete destruction of the data-mining utility. In most cases, trivial sanitization provides equivalent utility and better privacy than k-anonymity, e-diversity, and similar methods based on generalization and suppression.
机译:重新识别是对包含个人记录的公共数据集的主要隐私威胁。许多隐私保护算法依赖于泛化和抑制“准标识符”属性,例如邮政编码和出生。他们的目标通常是句法消毒:例如,k-匿名需要每个“准识别符”元组在至少k记录中出现,而e-分集要求每个准识别仪的敏感属性的分布具有高熵。消毒数据的效用也在句法上进行测量,通过应用的泛化步骤的数量或具有相同准标识符的记录数。在本文中,我们询问了准标识符的泛化和抑制是否为琐碎的消毒提供了任何益处,这简单地将准标识符与敏感属性分开。以前的工作表明,K-Anonymous数据库可以对数据挖掘有用,但KC-匿名化不保证任何隐私。相比之下,我们衡量隐私之间的权衡(对手可以从消毒记录中学到多少?)和实用程序,以在同一消毒记录上执行的数据挖掘算法的准确性来测量。对于我们的实验评估,我们使用UCI机器学习存储库的相同数据集,以前用于泛化和抑制的研究。我们的结果表明,即使是适度的隐私收益也需要几乎完全销毁数据采矿实用程序。在大多数情况下,琐碎的待遇提供了相同的实用性和比K-Anymony,E-多样性和基于泛化和抑制的类似方法的更好的隐私。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号