首页> 外文会议>2019 12th IEEE Conference on Software Testing, Validation and Verification >SmokeOut: An Approach for Testing Clustering Implementations
【24h】

SmokeOut: An Approach for Testing Clustering Implementations

机译:SmokeOut:一种测试群集实现的方法

获取原文
获取原文并翻译 | 示例

摘要

Clustering is a key Machine Learning technique, used in many high-stakes domains from medicine to self-driving cars. Many clustering algorithms have been proposed, and these algorithms have been implemented in many toolkits. Clustering users assume that clustering implementations are correct, reliable, and for a given algorithm, interchangeable. We challenge these assumptions. We introduce SmokeOut, an approach and tool that pits clustering implementations against each other (and against themselves) while controlling for algorithm and dataset, to find datasets where clustering outcomes differ when they shouldn't, and measure this difference. We ran SmokeOut on 7 clustering algorithms (3 deterministic and 4 nondeterministic) implemented in 7 widely-used toolkits, and run in a variety of scenarios on the Penn Machine Learning Benchmark (162 datasets). SmokeOut has revealed that clustering implementations are fragile: on a given input dataset and using a given clustering algorithm, clustering outcomes and accuracy vary widely between (1) successive runs of the same toolkit; (2) different input parameters for that tool; (3) different toolkits.
机译:集群是一项关键的机器学习技术,广泛用于从医学到自动驾驶汽车等高风险领域。已经提出了许多聚类算法,并且已经在许多工具包中实现了这些算法。集群用户认为集群实现是正确,可靠的,并且对于给定的算法而言是可互换的。我们挑战这些假设。我们介绍了SmokeOut,这是一种在控制算法和数据集的同时使聚类实现彼此(以及与自身)相互对立的方法和工具,以查找聚类结果在不应该聚类的情况下出现差异的数据集,并衡量这种差异。我们在7种广泛使用的工具包中实施的7种聚类算法(3种确定性和4种非确定性)上运行SmokeOut,并在Penn机器学习基准(162个数据集)上的各种场景中运行。 SmokeOut发现聚类的实现很脆弱:在给定的输入数据集上并使用给定的聚类算法,聚类结果和准确性在(1)同一工具包的连续运行之间差异很大; (2)该工具的不同输入参数; (3)不同的工具包。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号