首页> 美国卫生研究院文献>other >Scalable Iterative Classification for Sanitizing Large-Scale Datasets
【2h】

Scalable Iterative Classification for Sanitizing Large-Scale Datasets

机译:用于消毒大规模数据集的可扩展迭代分类

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.
机译:廉价的无处不在的计算使您能够在广泛的领域中收集大量的个人数据。许多组织的目标是共享此类数据,同时又遮盖了可能会泄露个人身份信息的功能。这些数据大部分显示出较弱的结构(例如,文本),因此已经开发了机器学习方法来检测并从中删除标识符。虽然学习从来都不是完美的,依靠这种方法来清理数据可能会泄漏敏感信息,但通常可以接受很小的风险。我们的目标是平衡已发布数据的价值和对手发现泄漏的标识符的风险。我们将数据清理建模为以下游戏之间的博弈:1)选择一组分类器以应用于数据并仅发布预测为不敏感的实例的发布者与2)结合了机器学习和手动检查以发现泄漏的标识信息的攻击者。我们为发布商引入了一种快速的迭代贪婪算法,该算法可确保资源有限的对手的实用性较低。此外,使用五个文本数据集,我们说明了我们的算法对于最新的学习算法几乎不保留任何可自动识别的敏感实例,同时共享超过93%的原始数据,并在最多5次迭代后完成。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号