首页> 外文会议>European conference on machine learning and knowledge discovery in databases;ECML PKDD 2011 >Comparing Apples and Oranges Measuring Differences between Data Mining Results
【24h】

Comparing Apples and Oranges Measuring Differences between Data Mining Results

机译:比较苹果和橘子,衡量数据挖掘结果之间的差异

获取原文

摘要

Deciding whether the results of two different mining algorithms provide significantly different information is an important open problem in exploratory data mining. Whether the goal is to select the most informative result for analysis, or decide which mining approach will likely provide the most novel insight, it is essential that we can tell how different the information is that two results provide. In this paper we take a first step towards comparing exploratory results on binary data. We propose to meaningfully convert results into sets of noisy tiles, and compare between these sets by Maximum Entropy modelling and Kullback-Leibler divergence. The measure we construct this way is flexible, and allows us to naturally include background knowledge, such that differences in results can be measured from the perspective of what a user already knows. Furthermore, adding to its interpretability, it coincides with Jaccard dissimilarity when we only consider exact tiles. Our approach provides a means to study and tell differences between results of different data mining methods. As an application, we show that it can also be used to identify which parts of results best redescribe other results. Experimental evaluation shows our measure gives meaningful results, correctly identifies methods that are similar in nature, and automatically provides sound redescriptions of results.
机译:在探索性数据挖掘中,确定两种不同挖掘算法的结果是否提供明显不同的信息是一个重要的开放问题。无论目的是选择最有用的结果进行分析,还是决定哪种挖掘方法可能会提供最新颖的见解,所以至关重要的是,我们必须告诉我们两个结果所提供的信息有多么不同。在本文中,我们迈出了比较二进制数据的探索性结果的第一步。我们建议将结果有意义地转换为嘈杂的图块集,并通过最大熵建模和Kullback-Leibler散度在这些集之间进行比较。我们以这种方式构造的度量是灵活的,并允许我们自然地包括背景知识,以便可以从用户已经知道的角度来度量结果的差异。此外,除了它的可解释性之外,当我们仅考虑精确的图块时,它与Jaccard的相似性相符。我们的方法提供了一种手段来研究和区分不同数据挖掘方法的结果之间的差异。作为一个应用程序,我们证明了它也可以用于识别结果的哪些部分最好地重新描述其他结果。实验评估表明,我们的措施给出了有意义的结果,正确识别了性质相似的方法,并自动提供了正确的结果描述。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号