首页> 外文会议>Pacific-Asia conference on knowledge discovery and data mining >DISC: Data-Intensive Similarity Measure for Categorical Data
【24h】

DISC: Data-Intensive Similarity Measure for Categorical Data

机译:光盘:分类数据的数据密集型相似度量

获取原文

摘要

The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. In this paper we present a new similarity measure for categorical data DISC - Data-Intensive Similarity Measure for Categorical Data. DISC captures the semantics of the data without any help from domain expert for defining the similarity. In addition to these, it is generic and simple to implement. These desirable features make it a very attractive alternative to existing approaches. Our experimental study compares it with 14 other similarity measures on 24 standard real datasets, out of which 12 are used for classification and 12 for regression, and shows that it is more accurate than all its competitors.
机译:相似的概念几乎在所有科学领域至关重要的。聚类的,基于距离异常值检测,分类,回归和搜索是其计算实例和特定相似性度量的,因此选择之间的相似性可以证明是该算法的成功或失败的主要原因主要的数据挖掘技术。相似度或距离分类数据的概念是不那么简单,因为对于连续的数据,因此,是一个重大的挑战。这是由于以下事实:由一个分类属性采取不同的值并不固有有序且因此两个分类值之间的直接比较的概念是不可能的。另外,相似的概念可以根据特定的域,数据集,或在手任务不同。在本文中,我们提出了明确的数据光盘新的相似性度量 - 数据密集型相似性度量分类数据。 DISC捕获数据的语义,而不从领域专家的帮助来定义相似性。除了这些,它是通用的,很容易实现。这些理想的特性使它非常有吸引力的替代现有的方法。我们的实验研究,对24个标准真实数据集,当中12 14周其他类似的措施用于分类和12回归,并显示它比所有竞争对手更准确的进行比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号