...
首页> 外文期刊>SIGMOD record >State-of-the-art in String Similarity Search and Join
【24h】

State-of-the-art in String Similarity Search and Join

机译:字符串相似性搜索和连接的最新技术

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

String similarity search and its variants are fundamental problems with many applications in areas such as data integration, data quality, computational linguistics, or bioinformatics. A plethora of methods have been developed over the last decades. Obtaining an overview of the state-of-the-art in this field is difficult, as results are published in various domains without much cross-talk, papers use different data sets and often study subtle variations of the core problems, and the sheer number of proposed methods exceeds the capacity of a single research group. In this paper, we report on the results of the probably largest benchmark ever performed in this field. To overcome the resource bottleneck, we organized the benchmark as an international competition, a workshop at EDBT/ICDT 2013. Various teams from different fields and from all over the world developed or tuned programs for two crisply defined problems. All algorithms were evaluated by an external group on two machines. Altogether, we compared 14 different programs on two string matching problems (k-approximate search and k-approximate join) using data sets of increasing sizes and with different characteristics from two different domains. We compare programs primarily by wall clock time, but also provide results on memory usage, indexing time, batch query effects and scalability in terms of CPU cores. Results were averaged over several runs and confirmed on a second, different hardware platform. A particularly interesting observation is that disciplines can and should learn more from each other, with the three best teams rooting in computational linguistics, databases, and bioinformatics, respectively.
机译:字符串相似性搜索及其变体是诸如数据集成,数据质量,计算语言学或生物信息学等领域中许多应用程序的基本问题。在过去的几十年中,已经开发了许多方法。对该领域的最新技术进行概述是困难的,因为结果在各个领域发布而没有太多的相互干扰,论文使用了不同的数据集,并且经常研究核心问题的细微变化以及数量众多。建议的方法超出了单个研究小组的能力。在本文中,我们报告了该领域可能执行的最大基准测试的结果。为了克服资源瓶颈,我们在EDBT / ICDT 2013上组织了基准测试作为国际竞赛和研讨会。来自不同领域和世界各地的各个团队针对两个明确定义的问题开发或调整了程序。所有算法均由外部机器在两台计算机上评估。我们总共比较了14个不同程序在两个字符串匹配问题(k近似搜索和k近似联接)上的问题,这些问题使用的数据集的大小不断增加且具有来自两个不同域的不同特征。我们主要根据时钟时间比较程序,但也提供有关内存使用,索引时间,批处理查询效果和CPU核心可伸缩性的结果。在几次运行中将结果取平均值,并在另一个不同的硬件平台上进行确认。一个特别有趣的发现是,学科之间可以并且应该相互学习更多,三个最佳团队分别扎根于计算语言学,数据库和生物信息学。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号