首页> 外文会议> >Using category-based adherence to cluster market-basket data
【24h】

Using category-based adherence to cluster market-basket data

机译:使用基于类别的依从性来集群市场篮子数据

获取原文

摘要

We devise an efficient algorithm for clustering market-basket data. Different from those of the traditional data, the features of market-basket data are known to be of high dimensionality, sparsity, and with massive outliers. Without explicitly considering the presence of the taxonomy, most prior efforts on clustering market-basket data can be viewed as dealing with items in the leaf level of the taxonomy tree. Clustering transactions across different levels of the taxonomy is of great importance for marketing strategies as well as for the result representation of the clustering techniques for market-basket data. In view of the features of market-basket data, we devise a measurement, called the category-based adherence, and utilize this measurement to perform the clustering. The distance of an item to a given cluster is defined as the number of links between this item and its nearest large node in the taxonomy tree where a large node is an item or a category node whose occurrence count exceeds a given threshold. The category-based adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm CBA, for market-basket data with the objective to minimize the category-based adherence. A validation model based on information gain is also devised to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, algorithm CBA significantly outperforms the prior works in both the execution efficiency and the clustering quality for market-basket data.
机译:我们设计了一种用于集群市场篮子数据的高效算法。与传统数据的不同,市场篮子数据的特征是众所周知的,具有高维度,稀疏性和大量异常值。未经明确考虑分类学的存在,大多数情况下都可以将市场篮子数据的努力视为处理分类树的叶片水平的物品。不同级别的分类级别的聚类交易对于营销策略以及市场篮子数据集群技术的结果表示非常重要。鉴于市场篮子数据的特征,我们设计了一种测量,称为基于类别的遵守,并利用此测量来执行聚类。项目到给定集群的距离被定义为该项目与分类图中的最接近的大节点之间的链路数量,其中大节点是其出现计数超过给定阈值的项目或类别节点。然后将事务的基于类别的基于类别的遵守作为本类别的遵守测量的该类别的该事务中的项目的平均距离,我们开发了一个有效的聚类算法,称为市场篮子,称为算法CBA目的是最大限度地减少基于类别的依从性。还设计了一种基于信息增益的验证模型,以评估市场篮子数据的聚类质量。如真实和合成数据集的验证,我们的实验结果显示,随着分类信息,算法CBA显着优于现有工作,以便在市场篮下数据的执行效率和聚类质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号