基于多核的细粒度并行的集合相似连接

荣垂田; 李银银; 冯林静; 汪剑鸣

摘要

相似连接是指在给定的两个数据集中,根据给定的相似性度量函数来计算数据之间的相似度,并找出所有相似度不小于给定阈值的数据对的操作.相似连接作为一个基本的操作,被广泛地应用于各种领域.随着网络和移动应用等信息技术的不断发展,数据呈现爆炸式增长,海量数据的分析需要强大的计算能力.为了满足不断增长的计算需求,提高计算效率和计算性能,计算机的体系结构也不断升级,出现了多核多处理器架构.为了提高相似连接的效率和计算资源的利用率,文中提出了基于多核的并行相似连接方法.相似连接操作与传统关系数据库中结构化数据之间的连接操作不同,相似连接处理的是异构数据,每一条数据处理的代价与其结构有关.为了实现多核处理器之间的任务均衡,文中提出了适合相似连接操作特点的数据长度均衡的数据划分方法.根据相似连接操作遵循Filter-Refine两阶段框架的特点,结合现代计算机体系结构的多核特性,提出了基于共享索引的任务分解方法和基于独立索引的任务分解方法.文中利用提出的数据划分方法和任务分解方法,实现了基于多核的并行化相似连接算法,包括自连接和R-S连接.文中对两种不同的实现方式的时间代价进行了分析,其中包括索引更新、索引扫描以及集合交运算的代价,从理论分析的角度证明了数据长度均衡划分和独立索引的实现方式在执行效率上具有优势.文中实验部分利用不同的数据集在多核多处理器平台上对并行相似连接的不同实现方式的执行效率和可扩展性进行了验证.实验结果证明,文中提出的数据长度均衡的数据划分方法以及基于独立索引的任务分解方法可以有效地提高并行相似连接的执行效率,在16核平台上使用16个线程在DBLP数据集上执行并行的相似自连接以及在CiteSeer和DBLP两个数据集上执行并行的R-S连接都可以在1秒内完成.%Similarity join is a primitive operation that is widely used in many applications.It is used to find all similarity pairs whose similarity are not less than the given threshold from two datasets using the given similarity functions.As the development of information technologies,such as Internet and mobile applications et.al,the scale of the data set is increasing vastly.In order to analyze large scale data sets,it is needed to improve the computation capacity of computers.So,the architecture of computer with multi cores and multi processors has been invented to satisfy the increasing need of computation and to improve the computer's efficiency and performance.In order to improve the efficiency of similarity join and the utilization of computation resource,we proposed a solution for parallel similarity join based on multicores.In the relational database systems,the join operations are performed between structured data sets.While,the similarity join performs join operations between heterogeneous data sets.There is a big difference between these two different operations.The cost of joining two records in relational database is the same in one query,as it performs joins on the same type of attributes.However,the cost of similarity join operation is related with the length of the attributes that to be joined.In order to improve the efficiency of similarity joins and utilize the computation capacity of multicores efficiently,we proposed to implement parallel similarity joins on multicores system using multi-threads technology.While,workload balancing is the assurance to get high performance for parallel programing.In order to achieve workload balancing among multicores,we proposed the even-length data partition strategy.As similarity join algorithms typically adopt a two-step filter-and-refine approach,we proposed two different task decomposition methods,shared-index based method and independentindex based method,regarding to the features of modern computer architecture.Based on our proposed data partition strategy and task decomposition methods,we implemented the parallel similarity joins based on multicores.We also analyzed the time cost of the two different implementations,evenquantity data partition with shared-index and even-length data partition with independent-index.In the time cost analysis,we considered the cost of index update,index scan and sets intersection.Based on our analysis,we proved that the implementation applying even-length data partition and independent-index can get better performance.In the experiment section,we conducted extensive experiments on four different implementations of parallel similarity joins using two different datasets.The experimental results showed that our proposed implementation applying even-length data partition strategy and independent-index can fully utilize the parallel computing features of multicores and get the highest performance.The efficiency of similarity joins can be improved vastly.The parallel self similarity join on DBLP data set and the parallel R-S similarity join between two data sets of CiteSeer and DBLP can be completed in a second using 16 threads on 16 cores system.We also compared our proposed methods with the state-of-art methods on the two datasets under the same experiment setups.The experimental results showed that our proposed methods can outperform the other methods by a wide margin.

基于多核的细粒度并行的集合相似连接

摘要

著录项

相似文献

相关主题

期刊订阅