A Comparative Study of the Use of Coresets for Clustering Large Datasets

机译：使用核心集对大型数据集进行聚类的比较研究

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Coresets can be described as a compact subset such that models trained on coresets will also provide a good fit with models trained on full data set. By using coresets, we can scale down a big data to a tiny one in order to reduce the computational cost of a machine learning problem. In recent years, data scientists have investigated various methods to create coresets. The two state-of-the-art algorithms have been proposed in 2018 are ProTraS by Ros & Guillaume and Lightweight Coreset by Bachem et al. In this paper, we briefly introduce these two algorithms and make a comparison between them to find out the benefits and drawbacks of each one.

机译：核心集可以描述为一个紧凑的子集，这样，在核心集上训练的模型也将与在完整数据集上训练的模型很好地契合。通过使用核心集，我们可以将大数据缩减为很小的数据，以减少机器学习问题的计算成本。近年来，数据科学家研究了各种创建核心集的方法。 Ros＆Guillaume的ProTraS和Bachem等人的Lightweight Coreset于2018年提出了两种最先进的算法。在本文中，我们简要介绍了这两种算法，并进行了比较，以找出每种算法的优缺点。

著录项

来源
《International conference on future data and security engineering》|2019年|45-55|共11页
会议地点
作者
Nguyen Le Hoang; Tran Khanh Dang; Le Hong Trang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Big data; Coresets; Clustering; k-means; k-median;

机译：大数据;核心集;集群; k均值k中值;

相似文献

外文文献
中文文献
专利

1. Comparative study of Basel EGS reservoir faults inferred from analysis of microseismic cluster datasets with fracture zones obtained from well log analysis [J] . Ziegler Martin, Evans Keith F. Journal of structural geology . 2020,第Jana期

机译：由测井分析获得的含断裂带的微震群数据集分析推断出的巴塞尔EGS储层断层比较
2. Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets [J] . Bipul Hossen, Rabiul Auwul Biomedical Statistics and Informatics . 2020,第1期

机译：K-Meance的比较研究，用癌症数据集分区麦细管，凝聚等级和戴安纳聚类算法
3. Comparative Study of Clustering Methods over Ill- Structured Datasets using Validity Indices [J] . Sheik Faritha Begum, K. P. Kaliyamurthie, A. Rajesh Indian Journal of Science and Technology . 2016,第12期

机译：使用有效性指标对结构不良数据集的聚类方法进行比较研究
4. A Comparative Study of the Use of Coresets for Clustering Large Datasets [C] . Nguyen Le Hoang, Tran Khanh Dang, Le Hong Trang International Conference on Future Data and Security Engineering . 2019

机译：群体聚类大型数据集的使用比较研究
5. Supervised precision ordinal clustering – A human-machine learning algorithm to create accurate clusters in big datasets: Application to indiana water quality data with novel visualization techniques [D] . Singh, Sarabjit 2014

机译：有监督的有序序数聚类–一种人机学习算法，可在大型数据集中创建准确的聚类：采用新颖的可视化技术应用于印第安纳州水质数据
6. Binomial outcomes in dataset with some clusters of size two: can the dependence of twins be accounted for? A simulation study comparing the reliability of statistical methods based on a dataset of preterm infants [O] . Odile Sauzet, Janet L. Peacock 2017

机译：具有大小为2的某些簇的数据集中的二项式结果：是否可以解释双胞胎的依赖性？基于早产儿数据集比较统计方法可靠性的模拟研究
7. A Comparative Study of the Some Methods Used in Constructing Coresets for Clustering Large Datasets [O] . Nguyen Le Hoang, Le Hong Trang, Tran Khanh Dang 2020

机译：用于构建大型数据集的血管基因的一些方法的比较研究

A Comparative Study of the Use of Coresets for Clustering Large Datasets

摘要

著录项

相似文献

相关主题

期刊订阅