首页> 外文期刊>ACM transactions on knowledge discovery from data >Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality
【24h】

Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality

机译:用于数据子空间基数预测分析的查询驱动学习

获取原文
获取原文并翻译 | 示例

摘要

Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts' access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches.
机译:许多预测分析任务的基础是估计多维数据子空间的基数(数据项数)的能力,多维数据子空间由数据集上的查询选择定义。这对于处理例如交互式数据子空间探索,数据子空间可视化以及查询处理优化的数据分析员至关重要。但是,在许多现代数据系统中,预测分析可能(i)在金钱上成本太高,例如在云中;(ii)在现代大数据查询引擎中不可靠,在这些情况下,准确的统计信息很难获得/维护;或(iii)不可行,例如由于隐私问题。我们为分析师定义的数据子空间基数提供了一种新颖的,查询驱动的功能估计模型。所提出的估计模型在预测和适应众所周知的选择查询方面是高度准确的:多维范围查询和距离最近的邻居(半径)查询。我们的函数估计模型:(i)通过学习数据空间上分析师的访问模式来量化矢量查询空间,(ii)将查询向量与其分析师定义的数据子空间的对应基数相关联,(iii)抽象并采用查询向量相似性来预测未知/未探索数据子空间的基数,并且(iv)根据最佳停止理论识别并适应查询子空间的可能变化。所提出的模型是分散式的,有利于横向扩展此类预测分析查询。该模型的研究意义在于(i)当数据驱动的统计技术不受欢迎或不可行时,这是一个有吸引力的解决方案;(ii)提供了横向扩展,分散的培训解决方案;(iii)适用于不同的解决方案选择查询类型,并且(iv)它提供的性能优于数据驱动方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号