Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection

Guansong Pang; Longbing Cao; Ling Chen; Huan Liu

首页> 外文期刊>SIGKDD explorations >Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection

【24h】

Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection

机译：基于随机距离的异常检测的超高维数据学习表示

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Learning expressive low-dimensional representations of ultrahigh dimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequent outlier detection methods, which can result in suboptimal and unstable performance of detecting irregularities (i.e., outliers). This paper introduces a ranking model-based framework, called RAMODO, to address this issue. RAMODO unifies representation learning and outlier detection to learn low-dimensional representations that are tailored for a state-of-the-art outlier detection approach - the random distance-based approach. This customized learning yields more optimal and stable representations for the targeted outlier detectors. Additionally, RAMODO can leverage little labeled data as prior knowledge to learn more expressive and application-relevant representations. We instantiate RAMODO to an efficient method called REPEN to demonstrate the performance of RAMODO. Extensive empirical results on eight real-world ultrahigh dimensional data sets show that REPEN (i) enables a random distance-based detector to obtain significantly better AUC performance and two orders of magnitude speedup; (ii) performs substantially better and more stably than four state-of-the-art representation learning methods; and (iii) leverages less than 1% labeled data to achieve up to 32% AUC improvement.

机译：学习超高尺寸数据的表现力的低维表示，例如，具有数千/数百万个特征的数据，是实现学习方法来解决维度的诅咒的主要方法。然而，现有的无监督的表示学习方法主要集中在保留数据规律性信息和学习表示的陈述，这些概率检测方法可以导致检测不规则性（即异常值）的次优和不稳定性能。本文介绍了一个基于排名的基于模型的框架，称为Ramodo，以解决这个问题。 Ramodo统一表示学习和异常值检测，以了解用于最先进的异常检测方法 - 基于随机距离的方法量身定制的低维表示。这种定制的学习对目标异常值探测器产生了更优化和稳定的表示。此外，Ramodo可以利用几乎标记的数据作为先验知识，以了解更多的表现力和相关的表现。我们将Ramodo实例化到一个称为Repen的有效方法，以演示Ramodo的性能。八个现实世界超高尺寸数据集的广泛经验结果表明，收益（i）能够获得随机距离的检测器，以获得明显更好的AUC性能和两个数量级加速; （ii）在四个最先进的代表学习方法中表现出基本更好，更稳定; （iii）利用少于1％的标记数据来实现高达32％的AUC改进。

著录项

来源
《SIGKDD explorations》 |2018年第udisk期|共10页
作者
Guansong Pang; Longbing Cao; Ling Chen; Huan Liu;
展开▼
作者单位

University of Technology Sydney;

University of Technology Sydney;

University of Technology Sydney;

Arizona State University;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类 TP274.2;
关键词
Outlier Detection; Representation Learning; Ultrahigh-dimensional Data; Dimension Reduction;

机译：异常检测;表示学习;超高维数据;减少尺寸;

相似文献

外文文献
中文文献
专利

1. Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection [J] . Guansong Pang, Longbing Cao, Ling Chen, SIGKDD explorations . 2018,第Udisk期

机译：基于随机距离的异常检测的超高维数据学习表示
2. Distance-based outlier detection for high dimension, low sample size data [J] . Ahn Jeongyoun, Lee Myung Hee, Lee Jung Ae Journal of applied statistics . 2019,第1a4期

机译：基于距离的远离高尺寸的异常检测，低样本大小数据
3. Efficient distance-based outlier detection on uncertain datasets of Gaussian distribution [J] . Salman A. Shaikh, Hiroyuki Kitagawa World Wide Web . 2014,第4期

机译：高斯分布不确定数据集的基于距离的有效离群值检测
4. A distance-based trajectory outlier detection method on maritime traffic data [C] . Bao Lei, Du Mingchao International Conference on Control, Automation and Robotics . 2018

机译：一种基于距离的海上交通数据轨迹离群值检测方法
5. Random Subspace Learning on Outlier Detection and Classification with Minimum Covariance Determinant Estimator. [D] . Liu, Bohan. 2016

机译：利用最小协方差行列式估计器进行离群值检测和分类的随机子空间学习。
6. Data mining application to healthcare fraud detection: a two-step unsupervised clustering method for outlier detection with administrative databases [O] . Michela Carlotta Massi, Francesca Ieva, Emanuele Lettieri 2020

机译：数据挖掘应用于医疗保健欺诈检测：使用管理数据库的异常值检测的两步无监督群集方法
7. Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection [O] . Guansong Pang, Longbing Cao, Ling Chen, 2018

机译：基于随机距离的异常检测的超高维数据学习表示

Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection

摘要

著录项

相似文献

相关主题

期刊订阅