Scalable Iterative Classification for Sanitizing Large-Scale Datasets

Bo Li; Yevgeniy Vorobeychik; Muqun Li; Bradley Malin

首页> 外文期刊>Theoretical and Experimental Plant Physiology >Scalable Iterative Classification for Sanitizing Large-Scale Datasets

【24h】

Scalable Iterative Classification for Sanitizing Large-Scale Datasets

机译：用于消毒大规模数据集的可扩展迭代分类

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93 percent of the original data, and completes after at most five iterations.

机译：廉价的无处不在的计算使得能够在各种域中收集大量的个人数据。许多组织旨在分享这些数据，同时模糊可能披露个人可识别信息的功能。这些数据的大部分都表现出薄弱的结构（例如，文本），使得已经开发了机器学习方法来检测和删除它的标识符。虽然学习永远不会完美，但依靠这些消毒数据的方法可以泄漏敏感信息，较小的风险通常是可接受的。我们的目标是平衡公布数据的价值以及对抗发现泄露的标识符的风险。我们将数据消毒模式为1）介于1）的一个游戏，该出版商选择一组分类器来应用于数据并仅发布预测为非敏感的实例和2）攻击者，该攻击者将机器学习和手动检查结合以揭示泄露泄漏的识别信息。我们为发布者介绍了一种快速迭代的贪婪算法，可确保资源有限的对手的低实用程序。此外，使用五个文本数据集，我们说明了我们的算法几乎没有用于最先进的学习算法的自动可识别的敏感实例，同时共享超过93％的原始数据，并在最多五个迭代之后完成。

著录项

来源
《Theoretical and Experimental Plant Physiology》 |2017年第3期|共14页
作者
Bo Li; Yevgeniy Vorobeychik; Muqun Li; Bradley Malin;
展开▼
作者单位

Vanderbilt University Nashville TN;

Vanderbilt University Nashville TN;

Vanderbilt University Nashville TN;

Vanderbilt University Nashville TN;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类植物生理学;
关键词
game theory; Privacy preserving; weak structured data sanitization;

机译：博弈论;隐私保留;结构弱结构数据消毒;

相似文献

外文文献
中文文献
专利

1. Scalable Iterative Classification for Sanitizing Large-Scale Datasets [J] . Bo Li, Yevgeniy Vorobeychik, Muqun Li, IEEE Transactions on Knowledge and Data Engineering . 2017,第3期

机译：用于消毒大规模数据集的可扩展迭代分类
2. Latent-lSVM classification of very high-dimensional and large-scale multi-class datasets [J] . Thanh-Nghi Do, François Poulet Concurrency and computation: practice and experience . 2019,第2期

机译：高维和大规模多类数据集的潜在lSVM分类
3. A fast classification strategy for SVM on the large-scale high-dimensional datasets [J] . Li I-Jing, Wu Jiunn-Lin, Yeh Chih-Hung Pattern Analysis and Applications . 2018,第4期

机译：大规模高维数据集上支持向量机的快速分类策略
4. Iterative Classification for Sanitizing Large-Scale Datasets [C] . Bo Li, Yevgeniy Vorobeychik, Muqun Li, IEEE International Conference on Data Mining . 2015

机译：消毒大规模数据集的迭代分类
5. Analysis of Large-Scale Human Genetic Datasets to Identify Novel Risk Factors and Therapeutic Targets for Cardiometabolic Disease [D] . ?Emdin, Connor 2020

机译：大规模人类遗传数据集分析，以识别心细素疾病的新危险因素和治疗靶标
6. Scalable Iterative Classification for Sanitizing Large-Scale Datasets [O] . Bo Li, Yevgeniy Vorobeychik, Muqun Li, -1

机译：用于消毒大规模数据集的可扩展迭代分类
7. Are open set classification methods effective on large-scale datasets? [O] . Ryne Roady, Tyler L. Hayes, Ronald Kemker, 2020

机译：开放式分类方法对大型数据集有效吗？

Scalable Iterative Classification for Sanitizing Large-Scale Datasets

摘要

著录项

相似文献

相关主题

期刊订阅