首页> 外国专利> rapid detection of interesting patterns in a database by sequential sampling

rapid detection of interesting patterns in a database by sequential sampling

机译:通过顺序采样快速检测数据库中有趣的模式

摘要

Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as n-best hypotheses problems where the goal is to find the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on confidence and quality of solutions. Known sampling approaches have treated single hypothesis selection problems, assuming that the utility be the average (over the examples) of some function-which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide these error bounds and resulting worst-case sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for a popular class of utility functions that cannot be estimated with bounded error. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already seem to be particularly good (or bad) after a few examples. Thus, the algorithm is almost always faster than its worst-case bounds.
机译:许多发现问题(例如子组或关联规则发现)自然可以转换为n个最佳假设问题,其中目标是从给定的假设空间中找到根据某个效用函数得分最高的n个假设。我们提出了一种采样算法,通过发出少量数据库查询来解决此问题,同时保证精确度和解决方案质量上的界限。假设效用是某个函数的平均值(在示例中),已知的抽样方法已经处理了单个假设选择问题,而对于许多常用的效用函数而言,情况并非如此。我们证明了我们的算法适用于所有可以估计有限误差的效用。我们为一些最常用的实用程序提供了这些误差范围以及由此产生的最坏情况的样本范围,并证明对于无法使用有界误差进行估计的通用类实用程序函数没有采样算法。从几个示例开始,该算法开始返回(或丢弃)似乎已经特别好(或不好)的假设,因此该算法是顺序的。因此,该算法几乎总是比其最坏情况下的边界更快。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号