【24h】

Achieving anonymity via clustering

机译:通过聚类实现匿名

获取原文

摘要

Publishing data for analysis from a table containing personal records, while maintaining individual privacy, is a problem of increasing importance today. The traditional approach of de-identifying records is to remove identifying fields such as social security number, name etc. However, recent research has shown that a large fraction of the US population can be identified using non-key attributes (called quasi-identifiers) such as date of birth, gender, and zip code [15]. Sweeney [16] proposed the k-anonymity model for privacy where non-key attributes that leak information are suppressed or generalized so that, for every record in the modified table, there are at least k−1 other records having exactly the same values for quasi-identifiers. We propose a new method for anonymizing data records, where quasi-identifiers of data records are first clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a pre-specified number of data records. This technique is more general since we have a much larger choice for cluster centers than k-Anonymity. In many cases, it lets us release a lot more information without compromising privacy. We also provide constant-factor approximation algorithms to come up with such a clustering. This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k. We further observe that a few outlier points can significantly increase the cost of anonymization. Hence, we extend our algorithms to allow an ε fraction of points to remain unclustered, i.e., deleted from the anonymized publication. Thus, by not releasing a small fraction of the database records, we can ensure that the data published for analysis has less distortion and hence is more useful. Our approximation algorithms for new clustering objectives are of independent interest and could be applicable in other clusteringscenarios as well.
机译:在保持个人隐私的同时,从包含个人记录的表中发布数据进行分析是当今日益重要的问题。取消记录识别的传统方法是删除诸如社会安全号码,姓名等识别字段。但是,最近的研究表明,可以使用非关键属性(称为准标识符)来识别美国人口的很大一部分。例如出生日期,性别和邮政编码[15]。 Sweeney [16]提出了用于隐私的 k -匿名模型,其中泄漏信息的非关键属性被抑制或泛化,因此对于修改表中的每个记录,至少有 k个 −1其他记录的准标识符值完全相同。我们提出了一种匿名化数据记录的新方法,其中首先对数据记录的准标识符进行聚类,然后发布聚类中心。为了确保数据记录的私密性,我们强加了每个群集必须包含不少于预定数量的数据记录的约束。由于我们对集群中心的选择要比 k -Anonymity大得多,因此该技术更为通用。在许多情况下,它使我们可以发布更多的信息而又不会损害隐私。我们还提供了恒定因子近似算法来提出此类聚类。这是用于匿名化问题的第一组算法,其中性能独立于匿名性参数 k 。我们进一步观察到,一些离群点会大大增加匿名化的成本。因此,我们扩展了算法以允许点的ε部分保持未聚类,即从匿名出版物中删除。因此,通过不释放数据库记录的一小部分,我们可以确保发布用于分析的数据具有较小的失真,因此更加有用。我们针对新聚类目标的近似算法具有独立的意义,并且可能适用于其他聚类场景。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号