【24h】

Discovering Frequent Patterns in Sensitive Data

机译:发现敏感数据中的频繁模式

获取原文

摘要

Discovering frequent patterns from data is a popular exploratory technique in data mining. However, if the data are sensitive (e.g., patient health records, user behavior records) releasing information about significant patterns or trends carries significant risk to privacy. This paper shows how one can accurately discover and release the most significant patterns along with their frequencies in a data set containing sensitive information, while providing rigorous guarantees of privacy for the individuals whose information is stored there.We present two efficient algorithms for discovering the k most frequent patterns in a data set of sensitive records. Our algorithms satisfy differential privacy, a recently introduced definition that provides meaningful privacy guarantees in the presence of arbitrary external information. Differentially private algorithms require a degree of uncertainty in their output to preserve privacy. Our algorithms handle this by returning 'noisy' lists of patterns that are close to the actual list of k most frequent patterns in the data. We define a new notion of utility that quantifies the output accuracy of private top-k pattern mining algorithms. In typical data sets, our utility criterion implies low false positive and false negative rates in the reported lists. We prove that our methods meet the new utility criterion; we also demonstrate the performance of our algorithms through extensive experiments on the transaction data sets from the FIMI repository. While the paper focuses on frequent pattern mining, the techniques developed here are relevant whenever the data mining output is a list of elements ordered according to an appropriately 'robust' measure of interest.
机译:从数据中发现频繁的模式是数据挖掘中一种流行的探索性技术。但是,如果数据是敏感的(例如,患者健康记录,用户行为记录),则发布有关重要模式或趋势的信息会给隐私带来重大风险。本文展示了如何在包含敏感信息的数据集中准确发现和释放最重要的模式及其频率,同时为存储信息的个人提供严格的隐私保护。 我们提出了两种有效的算法,用于发现敏感记录数据集中的k个最频繁的模式。我们的算法满足差分隐私,这是最近引入的定义,可以在存在任意外部信息的情况下提供有意义的隐私保证。差分私有算法要求其输出具有一定程度的不确定性以保护隐私。我们的算法通过返回“嘈杂”的模式列表来处理此问题,该列表与数据中k个最频繁的模式的实际列表接近。我们定义了一种实用性的新概念,该概念可量化私有top-k模式挖掘算法的输出精度。在典型数据集中,我们的效用标准意味着报告列表中的假阳性率和假阴性率均较低。我们证明我们的方法符合新的效用标准。我们还通过对FIMI存储库中的交易数据集进行了广泛的实验,证明了我们算法的性能。尽管本文关注的是频繁模式挖掘,但只要数据挖掘输出是根据适当的“稳健”度量标准排序的元素列表,此处开发的技术就很重要。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号