首页> 外文会议>ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008 >The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing
【24h】

The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing

机译:隐私权的代价:匿名数据发布中数据挖掘实用程序的破坏

获取原文

摘要

Re-identification is a major privacy threat to public datasets containing individual records. Many privacy protection algorithms rely on generalization and suppression of "quasi-identifier" attributes such as ZIP code and birthdate. Their objective is usually syntactic sanitization: for example, k-anonymity requires that each "quasi-identifier" tuple appear in at least k records, while e-diversity requires that the distribution of sensitive attributes for each quasi-identifier have high entropy. The utility of sanitized data is also measured syntactically, by the number of generalization steps applied or the number of records with the same quasi-identifier.In this paper, we ask whether generalization and suppression of quasi-identifiers offer any benefits over trivial sanitization which simply separates quasi-identifiers from sensitive attributes. Previous work showed that k-anonymous databases can be useful for data mining, but kc-anonymization does not guarantee any privacy. By contrast, we measure the tradeoff between privacy (how much can the adversary learn from the sanitized records?) and utility, measured as accuracy of data-mining algorithms executed on the same sanitized records.For our experimental evaluation, we use the same datasets from the UCI machine learning repository as were used in previous research on generalization and suppression. Our results demonstrate that even modest privacy gains require almost complete destruction of the data-mining utility. In most cases, trivial sanitization provides equivalent utility and better privacy than k-anonymity, e-diversity, and similar methods based on generalization and suppression.
机译:重新标识是对包含单个记录的公共数据集的主要隐私威胁。许多隐私保护算法都依赖于“准标识符”属性(例如邮政编码和生日)的概括和抑制。他们的目标通常是句法消毒:例如,k匿名性要求每个“准标识符”元组至少出现在k条记录中,而电子多样性要求每个准标识符的敏感属性的分布具有较高的熵。还可以通过应用概括步骤的数量或具有相同准标识符的记录的数量,从句法上对清理过的数据的效用进行度量。 在本文中,我们询问准标识符的泛化和抑制是否比简单的清理(将区分的标识符与敏感属性分开)带来更多的好处。先前的工作表明,k匿名数据库可用于数据挖掘,但kc匿名化不能保证任何隐私。相比之下,我们衡量了隐私(对手可以从清理后的记录中学到多少?)和效用之间的权衡,衡量为对相同清理后的记录执行的数据挖掘算法的准确性。 对于我们的实验评估,我们使用与先前关于泛化和抑制研究相同的UCI机器学习存储库中的数据集。我们的结果表明,即使是适度的隐私保护,也几乎需要彻底破坏数据挖掘实用程序。在大多数情况下,简单卫生处理比k-匿名性,电子多样性和基于泛化和抑制的类似方法可提供同等的实用性和更好的隐私性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号