【24h】

Bit-sliced index arithmetic

机译:位切片索引算法

获取原文

摘要

The bit-sliced index (BSI) was originally defined in [ONQ97]. The current paper introduces the concept of BSI arithmetic. For any two BSI's X and Y on a table T, we show how to efficiently generate new BSI's Z, V, and W, such that Z = X + Y, V = X - Y, and W = MIN(X, Y); this means that if a row r in T has a value x represented in BSI X and a value y in BSI Y, the value for r in BSI Z will be x + y, the value in V will be x - y and the value in W will be MIN(x, y). Since a bitmap representing a set of rows is the simplest bit-sliced index, BSI arithmetic is the most straightforward way to determine multisets of rows (with duplicates) resulting from the SQL clauses UNION ALL (addition), EXCEPT ALL (subtraction), and INTERSECT ALL (min) (see [OO00, DB2SQL] for definitions of these clauses). Another contribution of the current paper is to generalize BSI range restrictions from [ONQ97] to a new non-Boolean form: to determine the top k BSI-valued rows, for ally meaningful value k between oneand the total number of rows in T. Together with bit-sliced addition, this permits us to solve a common basic problem of text retrieval: given an object-relational table T of rows representing documents, with a collection type column K representing keyword terms, we demonstrate an efficient algorithm to find k documents that share the largest number of terms with some query list Q of terms. A great deal of published work on such problems exists in the Information Retrieval (IR) field. The algorithm we introduce, which we call Bit-Sliced Term-Matching, or BSTM, uses an approach comparable in performance to the most efficient known IR algorithm, a major improvement on current DBMS text searching algorithms, with the advantage that it uses only indexing we propose for native database operations.

机译:

位切片索引( BSI )最初是在[ONQ97]中定义的。本文介绍了BSI算法的概念。对于表T上的任意两个BSI的X和Y,我们展示了如何有效地生成新的BSI的Z,V和W,从而Z = X + Y,V = X-Y和W = MIN(X,Y) ;这意味着,如果T中的行r具有以BSI X表示的值x和BSI Y的值y,则BSI Z中的r的值为x + y,V中的值为x-y且该值在W中将是MIN(x,y)。由于代表一组行的位图是最简单的位切片索引,因此BSI算法是确定由SQL子句UNION ALL(加法),EXCEPT ALL(减法)和SQL子句产生的多行(重复)的多集的最直接方法。 INTERSECT ALL(分钟)(有关这些子句的定义,请参见[OO00,DB2SQL])。当前论文的另一项贡献是将BSI范围限制从[ONQ97]推广到一种新的非布尔形式:确定前k个BSI值的行,以使有意义的值k在T和T中的总行数之间。通过按位加法,可以解决文本检索的一个常见基本问题:给定一个表示文档的行的对象关系表T,用一个集合类型列K表示关键字词,我们演示了一种有效的算法来查找k个文档与某些词条的查询列表Q共享最大数目的词条。在信息检索( IR )领域中,存在许多有关此类问题的已发表工作。我们介绍的算法(称为位切片术语匹配,即BSTM)使用性能与最有效的已知IR算法相当的方法,这是当前DBMS文本搜索算法的一项重大改进,其优点是仅使用索引我们建议对本机数据库进行操作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号