首页> 外文会议>Bangalore Annual Compute Conference >Performance evaluation of similarity join for real time information integration
【24h】

Performance evaluation of similarity join for real time information integration

机译:相似性的性能评估加入实时信息集成

获取原文

摘要

Approximate join processing serves a key role in many application areas such as data cleansing, data integration, text mining, and bio-informatics. There has been much research interest in approximate join processing based on the concept of an edit distance metric. Approximate join processing algorithms generally use a variety of qgram based filtering techniques to improve the scalability of the system. The primary approach taken in the literature involves the exploitation of methods inside a particular database language. However, this is impractical in the case of heterogeneous data spread across multiple databases. A popular alternative approach involves the direct comparison of all permutations of two string pairings. However, such algorithms don't scale well for very large databases, even after applying qgram filters. Here we design a novel, stand-alone filtering technique, essentially a modification of the HashJoin algorithm, to improve the scalability of similarity join processing algorithms. We implement the algorithm and conduct a number of experiments to study the performance of the system. The presented algorithm is also integrated with a real-life data federation solution called Infosys Gradient. The paper presents the performance results on a real-life test bed.
机译:近似加入处理在许多应用领域(如数据清理,数据集成,文本挖掘和生物信息学)提供关键角色。基于编辑距离度量概念的近似加入处理已经存在多大的研究兴趣。近似加入处理算法通常使用基于QGram的滤波技术来提高系统的可扩展性。文献中采取的主要方法涉及利用特定数据库语言内的方法。但是,在多个数据库中的异构数据的情况下,这是不切实际的。一种流行的替代方法涉及两个字符串配对的所有排列的直接比较。但是,即使在应用QGram过滤器之后,此类算法也不会在非常大的数据库中扩展。在这里,我们设计了一种新颖,独立的过滤技术,基本上是Hashjoin算法的修改,提高了相似性Join处理算法的可扩展性。我们实施该算法并进行多项实验来研究系统性能。呈现的算法也与名为Infosys渐变的真实数据联合解决方案集成。本文介绍了现实寿命试验台上的性能结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号