Iterative Classification for Sanitizing Large-Scale Datasets

机译：消毒大规模数据集的迭代分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose identities or other sensitive information. Much of the data now collected exhibits weak structure (e.g., natural language text) and machine learning approaches have been developed to identify and remove sensitive entities in such data. Learning-based approaches are never perfect and relying upon them to sanitize data can leak sensitive information as a consequence. However, a small amount of risk is permissible in practice, and, thus, our goal is to balance the value of data published and the risk of an adversary discovering leaked sensitive information. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted to be non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked sensitive entities (e.g., personal names). We introduce an iterative greedy algorithm for the publisher that provably executes no more than a linear number of iterations, and ensures a low utility for a resource-limited adversary. Moreover, using several real world natural language corpora, we illustrate that our greedy algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.

机译：廉价的无处不在的计算使得能够在各种域中收集大量的个人数据。许多组织旨在分享这些数据，同时模糊可能披露身份或其他敏感信息的功能。现在收集的大部分数据表现出薄弱的结构（例如，自然语言文本）和机器学习方法已经开发出识别和去除这些数据中的敏感实体。基于学习的方法永远不会完美，依赖于他们消毒数据可以作为后果泄漏敏感信息。然而，在实践中允许少量风险，因此，我们的目标是平衡公开的数据价值和对抗发现泄露敏感信息的风险。我们将数据消毒方式模拟为1）介于1）的发布者，他们选择一组分类器来应用于数据并仅发布预测为非敏感的实例，2）将机器学习和手动检测结合到揭示泄漏的敏感实体的攻击者（例如，个人名称）。我们为出版商介绍了一种迭代贪婪算法，其可证明不超过线性迭代的线性次数，并确保资源有限的对手的低实用程序。此外，使用多种现实世界的自然语言语料库，我们说明了我们的贪婪算法几乎没有自动可识别的敏感实例用于最先进的学习算法，同时共享超过93％的原始数据，并在最多完成后完成5次迭代。

著录项

来源
《IEEE International Conference on Data Mining》|2015年||共6页
会议地点
作者
Bo Li; Yevgeniy Vorobeychik; Muqun Li; Bradley Malin;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP274.2-53;
关键词

相似文献

外文文献
中文文献
专利

1. Scalable Iterative Classification for Sanitizing Large-Scale Datasets [J] . Bo Li, Yevgeniy Vorobeychik, Muqun Li, IEEE Transactions on Knowledge and Data Engineering . 2017,第3期

机译：用于消毒大规模数据集的可扩展迭代分类
2. Scalable Iterative Classification for Sanitizing Large-Scale Datasets [J] . Bo Li, Yevgeniy Vorobeychik, Muqun Li, Theoretical and Experimental Plant Physiology . 2017,第3期

机译：用于消毒大规模数据集的可扩展迭代分类
3. An iterative SVM approach to feature selection and classification in high-dimensional datasets [J] . Liu D., Qian H., Dai G., Pattern Recognition: The Journal of the Pattern Recognition Society . 2013,第9期

机译：高维数据集中特征选择和分类的迭代SVM方法
4. Iterative Classification for Sanitizing Large-Scale Datasets [C] . Bo Li, Yevgeniy Vorobeychik, Muqun Li, IEEE International Conference on Data Mining . 2015

机译：消毒大规模数据集的迭代分类
5. Iterative Cell Extraction and Registration for Analysis of Time-Lapse Neural Calcium Imaging Datasets [D] . ?Tasci, Tugce 2020

机译：迭代细胞提取和注册分析延时神经钙成像数据集
6. Scalable Iterative Classification for Sanitizing Large-Scale Datasets [O] . Bo Li, Yevgeniy Vorobeychik, Muqun Li, -1

机译：用于消毒大规模数据集的可扩展迭代分类
7. Are open set classification methods effective on large-scale datasets? [O] . Ryne Roady, Tyler L. Hayes, Ronald Kemker, 2020

机译：开放式分类方法对大型数据集有效吗？

Iterative Classification for Sanitizing Large-Scale Datasets

摘要

著录项

相似文献

相关主题

期刊订阅