Semi-Supervised Approach to Rapid and Reliable Labeling of Large Data Sets

机译：快速管理大型数据集的半监督方法

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In this paper, we propose a method, where the labeling of the data set is carried out in a semi-supervised manner with user-specified guarantees about the quality of the labeling. In our scheme, we assume that for each class, we have some heuristics available, each of which can identify instances of one particular class. The heuristics are assumed to have reasonable performance but they do not need to cover all instances of the class nor do they need to be perfectly reliable. We further assume that we have an infallible expert, who is willing to manually label a few instances. The aim of the algorithm is to exploit the cluster structure of the problem, the predictions by the imperfect heuristics and the limited perfect labels provided by the expert to classify (label) the instances of the data set with guaranteed precision (specificed by the user) with regards to each class. The specified precision is not always attainable, so the algorithm is allowed to classify some instances as dontknow. The algorithm is evaluated by the number of instances labeled by the expert, the number of dontknow instances (global coverage) and the achieved quality of the labeling. On the KDD Cup Network Intrusion data set containing 500,000 instances, we managed to label 96.6% of the instances while guaranteeing a nominal precision of 90% (with 95% confidence) by having the expert label 630 instances; and by having the expert label 1200 instances, we managed to guarantee 95% nominal precision while labeling 96.4% of the data. We also provide a case study of applying our scheme to label the network traffic collected at a large campus network.

机译：在本文中，我们提出了一种方法，其中以用户指定的关于标记质量的保证的半监督方式对数据集进行标记。在我们的方案中，我们假设对于每个类，我们都有一些可用的试探法，每个试探法都可以标识一个特定类的实例。试探法被认为具有合理的性能，但是它们不需要覆盖该类的所有实例，也不需要完全可靠。我们进一步假设我们有一位可靠的专家，他愿意手动标记一些实例。该算法的目的是利用问题的聚类结构，不完善的启发法进行的预测以及专家提供的有限完美标签，以保证精度（由用户指定）对数据集的实例进行分类（标记）关于每个班级。并非总是可以达到指定的精度，因此允许该算法将某些实例分类为“不知道”。通过专家标记的实例数量，不知道的实例数量（全局覆盖范围）和所达到的标记质量来评估该算法。在包含500,000个实例的KDD Cup网络入侵数据集上，我们成功地标记了96.6％的实例，同时通过为专家提供630个实例来保证名义精度为90％（置信度为95％）;并通过给专家标记1200个实例，我们设法保证了95％的标称精度，同时标记了96.4％的数据。我们还提供了一个案例研究，该案例适用于我们的方案来标记大型校园网络中收集的网络流量。

著录项

来源
《ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008》|2008年|623-631|共9页
会议地点
作者
Gyoergy J. Simon; Vipin Kumar; Zhi-Li Zhang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息与知识传播;
关键词
algorithms; theory; experimentation;

机译：算法;理论;实验;

相似文献

外文文献
中文文献
专利

1. A semi-supervised rough set and random forest approach for pattern classification of gene expression data [J] . Pradeep Kumar Mallick, Debahuti Mishra, Srikanta Patnaik, International journal of reasoning-based intelligent systems . 2016,第3a4期

机译：半监督粗糙集和随机森林方法用于基因表达数据的模式分类
2. An Approach for Data Labelling and Concept Drift Detection Based on Entropy Model in Rough Sets for Clustering Categorical Data [J] . H. Venkateswara Reddy, S. Viswanadha Raju, B. Suresh Kumar, Journal of information & knowledge management . 2014,第2期

机译：基于熵模型的粗糙集聚类数据的数据标注和概念漂移检测方法
3. Semi-Supervised Logistic Discrimination via Labeled Data and Unlabeled Data from Different Sampling Distributions [J] . Shuichi Kawano Statistical Analysis and Data Mining . 2013,第6期

机译：通过不同采样分布中的标记数据和未标记数据进行半监督物流区分
4. Semi-supervised approach to rapid and reliable labeling of large data sets [C] . Gyorgy J. Simon, Vipin Kumar, Zhi-Li Zhang ACM SIGKDD international conference on Knowledge discovery and data mining . 2008

机译：半监督方法，快速，可靠地标记大数据集
5. Reliable Pattern Recognition System with Novel Semi-Supervised Learning Approach. [D] . He, Chun Lei. 2010

机译：具有新型半监督学习方法的可靠模式识别系统。
6. A semi-supervised approach using label propagation to support citation screening [O] . Georgios Kontonatsios, Austin J. Brockmeier, Piotr Przybyła, -1

机译：使用标签传播支持引文筛选的半监督方法
7. Erratum: Oncogenic KRAS Regulates Tumor Cell Signaling via Stromal Reciprocation:Our paper demonstrated the cell-autonomous and non-cell-autonomous effects of oncogene signaling in tumor and stromal cells using a proteomic approach. It has come to our attention that Data S1, which summarized our proteomic and phosphoproteomic data, included two sets of errors. In the tab related to Figure 3E, the data were labeled as representing log2-transformed ratios but were erroneously formatted to represent natural ratios. These numbers have now been changed to represent log2-transformed ratios. In the tab related to Figure 5, a copying error from our proteomics software caused the 6H time values to be incorrectly displayed. These values have also now been corrected. The values represented in the corrected version of Data S1 were the ones that had been used in our analyses throughout the paper, so the conclusions and figures in the paper remain unchanged. [O] . Tape, Christopher J., Ling, Stephanie, Dimitriadi, Maria, 2016

机译：错误：致癌KRas通过基质复制调节肿瘤细胞信号：我们的论文使用蛋白质组学方法证明了肿瘤和基质细胞中癌基因信号传导的细胞自主和非细胞自主作用。我们注意到数据s1总结了我们的蛋白质组学和磷酸化蛋白质组学数据，包括两组错误。在与图3E相关的标签中，数据被标记为表示log2转换的比率，但被错误地格式化以表示自然比率。现在已将这些数字更改为表示log2转换比率。在与图5相关的选项卡中，来自我们的蛋白质组学软件的复制错误导致6H时间值被错误地显示。这些值现在也已得到纠正。数据s1的更正版本中表示的值是我们在整篇论文中用于分析的值，因此论文中的结论和数字保持不变。

Semi-Supervised Approach to Rapid and Reliable Labeling of Large Data Sets

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅