Scalable Clustering for Mining Local-Correlated Clusters in High Dimensions and Large Datasets

Kun-Che Lu; Don-Lin Yang

首页> 外文期刊>Fundamenta Informaticae >Scalable Clustering for Mining Local-Correlated Clusters in High Dimensions and Large Datasets

【24h】

Scalable Clustering for Mining Local-Correlated Clusters in High Dimensions and Large Datasets

机译：用于在高维度和大数据集中挖掘与本地相关的集群的可伸缩集群

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Clustering is useful for mining the underlying structure of a dataset in order to support decision making since target or high-risk groups can be identified. However, for high dimensional datasets, the result of traditional clustering methods can be meaningless as clusters may only be depicted with respect to a small part of features. Taking customer datasets as an example, certain customers may correlate with their salary and education, and the others may correlate with their job and house location. If one uses all the features of a customer for clustering, these local-correlated clusters may not be revealed. In addition, processing high dimensions and large datasets is a challenging problem in decision making. Searching all the combinations of every feature with every record to extract local-correlated clusters is infeasible, which is in exponential scale in terms of data dimensionality and cardinality. In this paper, we propose a scalable 2-Leveled Approximated Hyper-image-based Clustering framework, referred as 2L-HIC-A, for mining local-correlated clusters, where each level clustering process requires only one scan of the original dataset. Moreover, the data-processing time of 2L-HIC-A can be independent of the input data size. In 2L-HIC-A, various well-developed image processing techniques can be exploited for mining clusters. In stead of proposing a new clustering algorithm, our framework can accommodate other clustering methods for mining local-corrected clusters, and to shed new light on the existing clustering techniques.

机译：聚类可用于挖掘数据集的基础结构以支持决策，因为可以识别目标或高风险组。但是，对于高维数据集，传统聚类方法的结果可能毫无意义，因为聚类可能仅针对特征的一小部分进行了描述。以客户数据集为例，某些客户可能与他们的工资和学历相关联，而其他客户可能与他们的工作和房屋所在地相关联。如果使用客户的所有功能进行群集，则可能不会显示这些与本地相关的群集。另外，处理高维度和大型数据集是决策中的挑战性问题。搜索每个特征与每个记录的所有组合以提取与本地相关的聚类是不可行的，这在数据维数和基数方面呈指数级。在本文中，我们提出了一种可扩展的基于两层近似基于超图像的聚类框架，称为2L-HIC-A，用于挖掘与本地相关的聚类，其中每个聚类过程仅需要对原始数据集进行一次扫描。而且，2L-HIC-A的数据处理时间可以与输入数据大小无关。在2L-HIC-A中，可以利用各种发达的图像处理技术来挖掘集群。代替提出新的聚类算法，我们的框架可以容纳用于挖掘局部校正的聚类的其他聚类方法，并为现有聚类技术提供新的思路。

著录项

来源
《Fundamenta Informaticae》 |2010年第1期|p.15-32|共18页
作者
Kun-Che Lu; Don-Lin Yang;
展开▼
作者单位

Department of Information Engineering and Computer Science Feng Chia University 100 Wen Hwa Road, Taichung, Taiwan, ROC;

Department of Information Engineering and Computer Science Feng Chia University 100 Wen Hwa Road, Taichung, Taiwan, ROC;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
local-correlated cluster; approximated clustering; high dimension; large dataset; image processing; morphology;

机译：本地相关集群近似聚类;高尺寸大型数据集图像处理;形态学;

相似文献

外文文献
中文文献
专利

1. SWIFT—Scalable Clustering for Automated Identification of Rare Cell Populations in Large, High-Dimensional Flow Cytometry Datasets, Part 2: Biological Evaluation [J] . Tim R. Mosmann, Iftekhar Naim, Jonathan Rebhahn, Cytometry, Part A: the journal of the International Society for Analytical Cytology . 2014,第5期

机译：SWIFT-可扩展的聚类，用于自动识别大型高维流式细胞术数据集中的稀有细胞群体，第2部分：生物评估
2. SWIFT—Scalable Clustering for Automated Identification of Rare Cell Populations in Large, High-Dimensional Flow Cytometry Datasets, Part 1: Algorithm Design [J] . Iftekhar Naim, Suprakash Datta, Jonathan Rebhahn, Cytometry, Part A: the journal of the International Society for Analytical Cytology . 2014,第5期

机译：SWIFT-可扩展的聚类，用于自动识别大型高维流式细胞术数据集中的稀有细胞群体，第1部分：算法设计
3. CLIC: clustering analysis of large microarray datasets with individual dimension-based clustering [J] . Gwan-Su Yi, Kihoon Cha, Taegyun Yun, Nucleic acids research . 2010,第suppla2期

机译：CLIC：使用基于单个维度的聚类对大型微阵列数据集进行聚类分析
4. DPM: Fast and scalable clustering algorithm for large scale high dimensional datasets [C] . Ghanem Tamer F., Elkilani Wail S., Ahmed Hatem S., International Computer Engineering Conference . 2014

机译：DPM：适用于大规模高维数据集的快速且可扩展的聚类算法
5. Visual data mining: Using parallel coordinate plots with K-means clustering and color to find correlations in a multidimensional dataset. [D] . Peterson, Angela R. 2009

机译：可视数据挖掘：使用具有K均值聚类和颜色的平行坐标图来查找多维数据集中的相关性。
6. SWIFT—Scalable Clustering for Automated Identification of Rare Cell Populations in Large High-Dimensional Flow Cytometry Datasets Part 2: Biological Evaluation [O] . Tim R Mosmann, Iftekhar Naim, Jonathan Rebhahn, -1

机译：SWIFT-可扩展的聚类用于自动识别大型高维流式细胞术数据集中的稀有细胞群体第2部分：生物评估
7. SWIFT—scalable clustering for automated identification of rare cell populations in large, high‐dimensional flow cytometry datasets, Part 1: Algorithm design [O] . Iftekhar Naim, Suprakash Datta, Jonathan Rebhahn, 2014

机译：Swift可扩展聚类，用于大型高尺寸流式细胞术数据集的稀有细胞群自动鉴定，第1部分：算法设计

Scalable Clustering for Mining Local-Correlated Clusters in High Dimensions and Large Datasets

摘要

著录项

相似文献

相关主题

期刊订阅