...
首页> 外文期刊>Data mining and knowledge discovery >Diverse subgroup set discovery
【24h】

Diverse subgroup set discovery

机译:多样的亚组集发现

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.
机译:出于多种原因,对于大多数现有的发现算法而言,大数据具有挑战性。首先,这样的数据导致了巨大的假设空间,使得穷举搜索变得不可行。第二,由于高基数的(数字)属性,相关属性等,存在许多基本相同模式的变体。这导致top-k挖掘算法返回高度冗余的结果集,而忽略了许多可能有趣的结果。这些问题在子组发现(SD)及其概括,特殊的模型挖掘中尤为明显。为了解决这个问题,我们引入了子组集发现:一个人不应该考虑单个子组,而应该考虑子组集。我们考虑了三个冗余度,并提出了相应的启发式选择策略以消除冗余。通过将这些(通用)子组选择方法合并到波束搜索中,目的是改善勘探与开发之间的平衡。所提议的算法被称为DSSD,可用于各种子集的发现,经过实验评估,并与现有方法进行了比较。为此,使用了各种目标类型以及相应的数据集和质量度量。通过竞争方法发现的子组集主要根据以下三个标准进行评估:(1)子组中的多样性涵盖(探索),(2)发现的最大质量(开发),以及(3)运行时。结果表明,根据特定设置,DSSD在所有或全部(非空)标准中均优于传统的SD方法。任务越复杂,使用我们多样化的启发式搜索的好处就越大。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号