首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Similarity Group-by Operators for Multi-Dimensional Relational Data
【24h】

Similarity Group-by Operators for Multi-Dimensional Relational Data

机译:多维关系数据的相似分组运算符

获取原文
获取原文并翻译 | 示例

摘要

The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.
机译:SQL分组操作符在汇总和汇总数据分析堆栈中的大型数据集时发挥着重要作用。虽然基于相等性的标准分组依据运算符在某些应用程序中很有用,但允许具有相似性的分组可以提供对现实世界数据的更真实视图,从而可以提供更好的见解。基于相似性SQL的“按组分组”运算符(简称SGB)通过将数据分组为相似但不一定相等的值来扩展标准SQL Group-by的语义。现有的基于相似度的分组运算符有效地实现了这些近似语义,但它们主要关注一维属性,并独立地处理多维属性。但是,诸如空间数据之类的相关属性是独立处理的,因此,多维空间中的组无法正确检测。为了解决这个问题,我们为多维数据引入了两个新的SGB运算符。第一个运算符是集团(或到所有目标的距离)SGB,其中组中的所有元组彼此之间相距一定距离。第二个运算符是到任何SGB的距离,如果该元组与组中的任何其他元组相距一定距离,则该元组属于组。由于元组可能满足多个组的成员资格标准,因此我们引入了三种不同的语义来处理这种情况:(i)消除元组,(ii)将元组放入任何一个组中,以及(iii)创建一个新组对于这个元组。我们在PostgreSQL中实现并测试了新的SGB运算符及其算法。这些操作员引入的开销被证明是最小的,并且执行时间与标准Group-by相当。基于TPC-H和社交签到数据的实验研究表明,与为解决同一问题而开发的基线方法相比,所提出的算法可以将性能提高多达三个数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号