Data Driven Similarity Measures for k-Means Like Clustering Algorithms

JACOB KOGAN; MARC TEBOULLE; CHARLES NICHOLAS

首页> 外文期刊>Information retrieval >Data Driven Similarity Measures for k-Means Like Clustering Algorithms

【24h】

Data Driven Similarity Measures for k-Means Like Clustering Algorithms

机译：类似于聚类算法的k均值的数据驱动相似性度量

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present an optimization approach that generates k-means like clustering algorithms. The batch k-means and the incremental k-means are two well known versions of the classical k-means clustering algorithm (Duda et al. 2000). To benefit from the speed of the batch version and the accuracy of the incremental version we combine the two in a "ping-pong" fashion. We use a distance-like function that combines the squared Euclidean distance with relative entropy. In the extreme cases our algorithm recovers the classical k-means clustering algorithm and generalizes the Divisive Information Theoretic clustering algorithm recently reported independently by Berkhin and Becher (2002) and Dhillonl et al. (2002). Results of numerical experiments that demonstrate the viability of our approach are reported.

机译：我们提出了一种优化方法，可以生成类似于聚类算法的k均值。批处理k均值和增量k均值是经典k均值聚类算法的两个众所周知的版本（Duda et al。2000）。为了从批处理版本的速度和增量版本的准确性中受益，我们以“乒乓”方式将两者结合在一起。我们使用类似距离的函数，将平方的欧几里得距离与相对熵结合在一起。在极端情况下，我们的算法恢复了经典的k均值聚类算法，并推广了Berkhin和Becher（2002）和Dhillonl等人最近独立报告的Divisive Information Theoretic聚类算法。（2002）。数值实验的结果表明了我们方法的可行性。

著录项

来源
《Information retrieval》 |2005年第2期|p.331-349|共19页
作者
JACOB KOGAN; MARC TEBOULLE; CHARLES NICHOLAS;
展开▼
作者单位

Department of Mathematics and Statistics, UMBC, Baltimore, MD 21250;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类图书馆学、图书馆事业;
关键词
clustering algorithms; optimization; entropy;

机译：聚类算法;优化;熵;

相似文献

外文文献
中文文献
专利

1. An Extensive Study of Similarity and Dissimilarity Measures Used for Text Document Clustering using K-means Algorithm [J] . Maedeh Afzali, Suresh Kumar International Journal of Information Technology and Computer Science . 2018,第9期

机译：基于K-means算法的文本文档聚类中相似度和相异度度量的广泛研究
2. A Pitman measure of similarity in k-means for clustering heavy-tailed data [J] . Reybod Arman, Etminan Javad, Mohammadpour Adel Communications in Statistics . 2019,第6a7期

机译：聚类重尾数据聚类的k均值的Pitman度量
3. A Pitman measure of similarity in k-means for clustering heavy-tailed data [J] . Reybod Arman, Etminan Javad, Mohammadpour Adel Communications in Statistics . 2019,第6a7期

机译：K-Meanse的Pitman测量相似性用于聚类重型数据
4. Clustering of dissimilar perception phase constructed for similarity measures using k-means algorithm [C] . Bindiya M.K, RaviKumar G.K International Conference on Applied and Theoretical Computing and Communication Technology . 2016

机译：使用k均值算法为相似性度量构建的异样感知阶段的聚类
5. Clustering educational digital library usage data: Comparisons of latent class analysis and K-means algorithms [D] . Xu, Beijie 2011

机译：聚集教育数字图书馆使用数据：潜在类别分析和K-means算法的比较
6. Balancing effort and benefit of K-means clustering algorithms in Big Data realms [O] . Joaquín Pérez-Ortega, Nelva Nely Almanza-Ortega, David Romero 2012

机译：大数据领域中K均值聚类算法的平衡工作和收益
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。

Data Driven Similarity Measures for k-Means Like Clustering Algorithms

摘要

著录项

相似文献

相关主题

期刊订阅