首页> 外文会议>International conference on very large data bases >Dima: A Distributed In-Memory Similarity-Based Query Processing System
【24h】

Dima: A Distributed In-Memory Similarity-Based Query Processing System

机译:Dima:一种基于内存相似度的分布式查询处理系统

获取原文

摘要

Data analysts in industries spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. It calls for effective query processing techniques to tolerate the errors and inconsistencies. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports two core similarity-based query operations, i.e.. similarity search and similarity join. Dima extends the SQL programming interface for users to easily invoke these two operations in their data analysis jobs. To avoid expensive data transformation in a distributed environment, we design selectable signatures where two records approximately match if they share common signatures. More importantly, we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support efficient similarity search and join. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support similarity-based query processing. We demonstrate our system in several scenarios, including entity matching, web table integration and query recommendation.
机译:由于数据错误和不一致,行业数据分析师在数据分析的整个过程中花费了80%以上的时间进行数据清理和集成。它要求有效的查询处理技术来容忍错误和不一致。在本文中,我们开发了一个基于内存的基于相似度的分布式查询处理系统Dima。 Dima支持两种基于核心的基于相似性的查询操作,即相似性搜索和相似性联接。 Dima扩展了SQL编程接口,使用户可以轻松地在其数据分析作业中调用这两个操作。为了避免在分布式环境中进行昂贵的数据转换,我们设计了可选签名,如果两个记录共享公共签名,则两个记录大致匹配。更重要的是,我们可以自适应地选择签名以平衡工作量。 Dima构建基于签名的全局索引和局部索引,以支持有效的相似性搜索和联接。由于Spark是被广泛采用的分布式内存计算系统之一,因此我们已将Dima无缝集成到Spark中,并在Spark中开发了有效的查询优化技术。据我们所知,这是第一个可以支持基于相似性查询处理的成熟的分布式内存系统。我们在几种情况下演示我们的系统,包括实体匹配,Web表集成和查询推荐。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号