首页> 外文会议>2013 IEEE International Conference on Big Data >Elastic algorithms for guaranteeing quality monotonicity in big data mining
【24h】

Elastic algorithms for guaranteeing quality monotonicity in big data mining

机译:大数据挖掘中保证质量单调性的弹性算法

获取原文
获取原文并翻译 | 示例

摘要

When mining large data volumes in big data applications users are typically willing to use algorithms that produce acceptable approximate results satisfying the given resource and time constraints. Two key challenges arise when designing such algorithms. The first relates to reasoning about tradeoffs between the quality of data mining output, e.g. prediction accuracy for classification tasks and available resource and time budgets. The second is organizing the computation of the algorithm to guarantee producing better quality of results as more budget is used. Little work has addressed these two challenges together in a generic way. In this paper, we propose a novel framework for developing elastic big data mining algorithms. Based on Shannon's entropy, an information-theoretic approach is introduced to reason about how result quality is affected by the allocated budget. This is then used to guide the development of algorithms that adapt to the available time budgets while guaranteeing producing better quality results as more budgets are used. We demonstrate the application of the framework by developing elastic k-Nearest Neighbour (kNN) classification and collaborative filtering (CF) recommendation algorithms as two examples. The core of both elastic algorithms is to use a naïve kNN classification or CF algorithm over R-tree data structures that successively approximate the entire datasets. Experimental evaluation was performed using prediction accuracy as quality metric on real datasets. The results show that elastic mining algorithms indeed produce results with consistent increase in observable qualities, i.e., prediction accuracy, in practice.
机译:在大数据应用程序中挖掘大数据量时,用户通常愿意使用能产生满足给定资源和时间约束的可接受的近似结果的算法。设计此类算法时会遇到两个关键挑战。第一个涉及到关于数据挖掘输出的质量之间的权衡的推理,例如分类任务的预测准确性以及可用的资源和时间预算。第二个是组织算法的计算,以确保随着使用更多预算而产生更好的结果质量。很少有工作以通用的方式一起解决这两个挑战。在本文中,我们提出了一个开发弹性大数据挖掘算法的新颖框架。基于香农的熵,引入了一种信息理论方法来说明分配预算如何影响结果质量。然后,它可用于指导算法的开发,以适应可用的时间预算,同时随着使用更多的预算,保证产生更好的质量结果。我们通过开发弹性k最近邻(kNN)分类和协作过滤(CF)推荐算法作为两个示例来演示该框架的应用。两种弹性算法的核心是在R树数据结构上使用朴素的kNN分类或CF算法,从而连续逼近整个数据集。使用预测准确性作为真实数据集的质量指标进行实验评估。结果表明,在实践中,弹性挖掘算法的确能产生可观察质量(即预测精度)持续提高的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号