首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Beyond independence: probabilistic models for query approximation on binary transaction data
【24h】

Beyond independence: probabilistic models for query approximation on binary transaction data

机译:超越独立性:二进制交易数据查询近似的概率模型

获取原文
获取原文并翻译 | 示例

摘要

We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer.
机译:我们研究了生成大型稀疏二进制数据集的查询的快速近似答案的问题。我们特别关注基于概率模型的方法来解决此问题,并开发出许多比基准独立性模型更准确的技术。特别是,我们介绍了两种从频繁项目集构建概率模型的技术:项目集最大熵模型和项目集包含与排除模型。在最大熵模型中,我们将项目集视为对查询变量分布的约束,并使用最大熵原理为在线查询属性建立联合概率模型。在包含/排除模型中,项目集及其频率存储在称为ADtree的数据结构中,该数据结构支持对包含/排除原理的有效实施,以便回答查询。我们将这两个基于项目集的模型进行经验比较,以直接查询原始数据,查询原始数据的样本以及其他概率模型,例如独立性模型,Chow-Liu树模型和Bernoulli混合模型。这些模型能够处理高维度(数百或数千个属性),而与此主题相关的大多数其他工作都集中在相对低维度的OLAP问题上。在模拟交易数据集和实际交易数据集上的实验结果都说明了近似误差,模型复杂性和计算查询答案所需的在线时间之间的各种基本折衷。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号