首页> 外文OA文献 >D-Search: an efficient and exact search algorithm for large distribution sets
【2h】

D-Search: an efficient and exact search algorithm for large distribution sets

机译:D-Search:针对大型分布集的高效且精确的搜索算法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。
获取外文期刊封面目录资料

摘要

Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions) to discover patterns, rules and outlier clouds. For example, consider the numerical case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution of 2-d points (one for each item he/she bought). We want to find similar users, e.g., for market segmentation or anomaly/fraud detection. We propose to address this problem and present D-Search, which includes fast and effective algorithms for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations, (2) multistep sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) time-series distribution mining, which finds similar subsequences in time-series distribution datasets. Extensive experiments on real multidimensional datasets show that our solution achieves a wall clock time up to 2, 300 times faster than the naive implementation without sacrificing accuracy.
机译:分布数据自然地出现在无数领域,例如气象,生物学,地质学,工业和经济学。但是,对大型分布集的数据挖掘的关注相对较少。给定多个类别的n个分布和查询分布Q,我们希望找到相似的云(即分布)以发现模式,规则和异常云。例如,考虑商品销售的数字情况,其中对于每个售出的商品,我们记录单价和数量;然后,将每个客户表示为二维点的分布(他/她购买的每件商品一个)。我们希望找到类似的用户,例如用于市场细分或异常/欺诈检测。我们提议解决这个问题并提出D-Search,其中包括用于大型分布数据集中相似性搜索的快速有效算法。我们的主要贡献是(1)近似KL散度,它可以加快云相似度计算,(2)多步顺序扫描,可以有效地修剪大量搜索候选者,并直接降低搜索成本。我们还介绍了D-Search的扩展版本:(3)时间序列分布挖掘,它可以在时间序列分布数据集中找到相似的子序列。在真实的多维数据集上进行的大量实验表明,我们的解决方案在不牺牲准确性的情况下,将挂钟时间提高了2倍,比朴素的实现快300倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号