Similarity Group-by Operators for Multi-Dimensional Relational Data

Tang Mingjie; Tahboub Ruby Y.; Aref Walid G.; Atallah Mikhail J.; Malluhi Qutaibah M.; Ouzzani Mourad; Silva Yasin N.

首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Similarity Group-by Operators for Multi-Dimensional Relational Data

【24h】

Similarity Group-by Operators for Multi-Dimensional Relational Data

机译：多维关系数据的相似分组运算符

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

机译：SQL分组操作符在汇总和汇总数据分析堆栈中的大型数据集时发挥着重要作用。虽然基于相等性的标准分组依据运算符在某些应用程序中很有用，但允许具有相似性的分组可以提供对现实世界数据的更真实视图，从而可以提供更好的见解。基于相似性SQL的“按组分组”运算符（简称SGB）通过将数据分组为相似但不一定相等的值来扩展标准SQL Group-by的语义。现有的基于相似度的分组运算符有效地实现了这些近似语义，但它们主要关注一维属性，并独立地处理多维属性。但是，诸如空间数据之类的相关属性是独立处理的，因此，多维空间中的组无法正确检测。为了解决这个问题，我们为多维数据引入了两个新的SGB运算符。第一个运算符是集团（或到所有目标的距离）SGB，其中组中的所有元组彼此之间相距一定距离。第二个运算符是到任何SGB的距离，如果该元组与组中的任何其他元组相距一定距离，则该元组属于组。由于元组可能满足多个组的成员资格标准，因此我们引入了三种不同的语义来处理这种情况：（i）消除元组，（ii）将元组放入任何一个组中，以及（iii）创建一个新组对于这个元组。我们在PostgreSQL中实现并测试了新的SGB运算符及其算法。这些操作员引入的开销被证明是最小的，并且执行时间与标准Group-by相当。基于TPC-H和社交签到数据的实验研究表明，与为解决同一问题而开发的基线方法相比，所提出的算法可以将性能提高多达三个数量级。

著录项

来源
《Knowledge and Data Engineering, IEEE Transactions on》 |2016年第2期|510-523|共14页
作者
Tang Mingjie; Tahboub Ruby Y.; Aref Walid G.; Atallah Mikhail J.; Malluhi Qutaibah M.; Ouzzani Mourad; Silva Yasin N.;
展开▼
作者单位

Department of Computer Science, Purdue University, Indiana, IN;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
SQL operators; Similarity query; multidimensional data; query processing; relational database; similarity query;

机译：SQL运算符;相似性查询;多维数据;查询处理;关系数据库;相似性查询;

相似文献

外文文献
中文文献
专利

1. The similarity-aware relational database set operators [J] . Al Marri Wadha J., Malluhi Qutaibah, Ouzzani Mourad, Information Systems . 2016,第Jula期

机译：相似感知关系数据库集运算符
2. The similarity-aware relational division database operator with case studies in agriculture and genetics [J] . Gonzaga Andre dos Santos, Cordeiro Robson L. F. Information Systems . 2019,第MAY期

机译：具有农业和遗传学案例研究的具有相似性的关系划分数据库操作员
3. A GIS-based relational data model for multi-dimensional representation of river hydrodynamics and morphodynamics [J] . Dongsu Kim, Marian Muste, Venkatesh Merwade Environmental Modelling & Software . 2015,第mara期

机译：基于GIS的关系数据模型用于河流水动力和形态动力的多维表示
4. Similarity Group-By operators for multi-dimensional relational data [C] . Mingjie Tang, Ruby Y. Tahboub, Walid G. Aref, IEEE International Conference on Data Engineering . 2016

机译：多维关系数据的按相似度分组运算符
5. The development of bucketing operators and a supporting operator framework for relational database management systems. [D] . Bruso, Kelsey Lee. 2007

机译：用于关系数据库管理系统的存储操作符和支持操作符框架的开发。
6. Similarity from Multi-Dimensional Scaling: Solving the Accuracy and Diversity Dilemma in Information Filtering [O] . Wei Zeng, An Zeng, Hao Liu, 2010

机译：多维缩放的相似性：解决信息过滤的准确性和多样性难题
7. Similarity Group-by Operators for Multi-dimensional Relational Data [O] . Tang, Mingjie, Tahboub, Ruby Y., Are, Walid G., 2014

机译：用于多维关系数据的相似性分组运算符
8. Operator Mapping between Relational Algebra Operators and CODASYL Based Databases Managed by a CODASYL DBMS [R] . Nicely, D. J. 1983

机译：关系代数运算符与基于CODasYL的数据库之间的运算符映射由CODasYL DBms管理

Similarity Group-by Operators for Multi-Dimensional Relational Data

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅