Dima: A Distributed In-Memory Similarity-Based Query Processing System

机译：Dima：一种基于内存相似度的分布式查询处理系统

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data analysts in industries spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. It calls for effective query processing techniques to tolerate the errors and inconsistencies. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports two core similarity-based query operations, i.e.. similarity search and similarity join. Dima extends the SQL programming interface for users to easily invoke these two operations in their data analysis jobs. To avoid expensive data transformation in a distributed environment, we design selectable signatures where two records approximately match if they share common signatures. More importantly, we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support efficient similarity search and join. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support similarity-based query processing. We demonstrate our system in several scenarios, including entity matching, web table integration and query recommendation.

机译：由于数据错误和不一致，行业数据分析师在数据分析的整个过程中花费了80％以上的时间进行数据清理和集成。它要求有效的查询处理技术来容忍错误和不一致。在本文中，我们开发了一个基于内存的基于相似度的分布式查询处理系统Dima。 Dima支持两种基于核心的基于相似性的查询操作，即相似性搜索和相似性联接。 Dima扩展了SQL编程接口，使用户可以轻松地在其数据分析作业中调用这两个操作。为了避免在分布式环境中进行昂贵的数据转换，我们设计了可选签名，如果两个记录共享公共签名，则两个记录大致匹配。更重要的是，我们可以自适应地选择签名以平衡工作量。 Dima构建基于签名的全局索引和局部索引，以支持有效的相似性搜索和联接。由于Spark是被广泛采用的分布式内存计算系统之一，因此我们已将Dima无缝集成到Spark中，并在Spark中开发了有效的查询优化技术。据我们所知，这是第一个可以支持基于相似性查询处理的成熟的分布式内存系统。我们在几种情况下演示我们的系统，包括实体匹配，Web表集成和查询推荐。

著录项

来源
《International conference on very large data bases》|2017年|1925-1928|共4页
会议地点
作者
Ji Sun; Zeyuan Shang; Guoliang Li; Dong Deng; Zhifeng Bao;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Distributed In-Memory Processing of All k Nearest Neighbor Queries [J] . Georgios Chatzimilioudis, Constantinos Costa, Demetrios Zeinalipour-Yazti, IEEE Transactions on Knowledge and Data Engineering . 2016,第4期

机译：所有k个最近邻居查询的分布式内存中处理
2. Similarity-based ranking and query processing in multimedia databases [J] . K. Selcuk Candan, Wen-Syan Li, M. Lakshmi Priya Data & Knowledge Engineering . 2000,第3期

机译：多媒体数据库中基于相似度的排名和查询处理
3. Efficient distance join query processing in distributed spatial data management systems [J] . Information Sciences: An International Journal . 2020,第期

机译：分布式空间数据管理系统中的高效距离连接查询处理
4. Dima: A Distributed In-Memory Similarity-Based Query Processing System [C] . Ji Sun, Zeyuan Shang, Guoliang Li, International conference on very large data bases . 2017

机译：DIMA：基于内存的分布式内存相似性的查询处理系统
5. Distributed RDF Storage and Querying Using In-Memory Processing Engine [D] . Hassan, P. M. Mahmudul. 2021

机译：使用内存处理引擎分布式RDF存储和查询
6. iSPEED: a Scalable and Distributed In-Memory Based Spatial Query System for Large and Structurally Complex 3D Data [O] . Hoang Vo, Yanhui Liang, Jun Kong, -1

机译：iSPEED：适用于大型且结构复杂的3D数据的可扩展的分布式基于内存的空间查询系统
7. LocationSpark: In-memory Distributed Spatial Query Processing and Optimization [O] . Mingjie Tang, Yongyang Yu, Ahmed R. Mahmood, 2020

机译：LocationsPark：内存分布式空间查询处理和优化
8. Knowledge-Based Approach to Integrating and Querying Distributed InformationSystems Heterogeneous Intelligent Processing for Engineering Design (HIPED) [R] . Navathe, S. B. 1997

机译：基于知识的集成和查询分布式信息系统的方法工程设计的异构智能处理（HIpED）

Dima: A Distributed In-Memory Similarity-Based Query Processing System

摘要

著录项

相似文献

相关主题

期刊订阅