首页> 外文会议>IEEE International Conference on Data Mining >Iterative Classification for Sanitizing Large-Scale Datasets
【24h】

Iterative Classification for Sanitizing Large-Scale Datasets

机译:消毒大规模数据集的迭代分类

获取原文

摘要

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose identities or other sensitive information. Much of the data now collected exhibits weak structure (e.g., natural language text) and machine learning approaches have been developed to identify and remove sensitive entities in such data. Learning-based approaches are never perfect and relying upon them to sanitize data can leak sensitive information as a consequence. However, a small amount of risk is permissible in practice, and, thus, our goal is to balance the value of data published and the risk of an adversary discovering leaked sensitive information. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted to be non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked sensitive entities (e.g., personal names). We introduce an iterative greedy algorithm for the publisher that provably executes no more than a linear number of iterations, and ensures a low utility for a resource-limited adversary. Moreover, using several real world natural language corpora, we illustrate that our greedy algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.
机译:廉价的无处不在的计算使得能够在各种域中收集大量的个人数据。许多组织旨在分享这些数据,同时模糊可能披露身份或其他敏感信息的功能。现在收集的大部分数据表现出薄弱的结构(例如,自然语言文本)和机器学习方法已经开发出识别和去除这些数据中的敏感实体。基于学习的方法永远不会完美,依赖于他们消毒数据可以作为后果泄漏敏感信息。然而,在实践中允许少量风险,因此,我们的目标是平衡公开的数据价值和对抗发现泄露敏感信息的风险。我们将数据消毒方式模拟为1)介于1)的发布者,他们选择一组分类器来应用于数据并仅发布预测为非敏感的实例,2)将机器学习和手动检测结合到揭示泄漏的敏感实体的攻击者(例如,个人名称)。我们为出版商介绍了一种迭代贪婪算法,其可证明不超过线性迭代的线性次数,并确保资源有限的对手的低实用程序。此外,使用多种现实世界的自然语言语料库,我们说明了我们的贪婪算法几乎没有自动可识别的敏感实例用于最先进的学习算法,同时共享超过93%的原始数据,并在最多完成后完成5次迭代。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号