State-of-the-art in String Similarity Search and Join

Sebastian Wandelt; Shashwat Mishra; Enrico Siragusa; Dong Deng; Petar Mitankin; Alexander Tiskin; Stefan Gerdjikov; Manish Patil; Wei Wang; Jiaying Wang; Ulf Leser

首页> 外文期刊>SIGMOD record >State-of-the-art in String Similarity Search and Join

【24h】

State-of-the-art in String Similarity Search and Join

机译：字符串相似性搜索和连接的最新技术

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

String similarity search and its variants are fundamental problems with many applications in areas such as data integration, data quality, computational linguistics, or bioinformatics. A plethora of methods have been developed over the last decades. Obtaining an overview of the state-of-the-art in this field is difficult, as results are published in various domains without much cross-talk, papers use different data sets and often study subtle variations of the core problems, and the sheer number of proposed methods exceeds the capacity of a single research group. In this paper, we report on the results of the probably largest benchmark ever performed in this field. To overcome the resource bottleneck, we organized the benchmark as an international competition, a workshop at EDBT/ICDT 2013. Various teams from different fields and from all over the world developed or tuned programs for two crisply defined problems. All algorithms were evaluated by an external group on two machines. Altogether, we compared 14 different programs on two string matching problems (k-approximate search and k-approximate join) using data sets of increasing sizes and with different characteristics from two different domains. We compare programs primarily by wall clock time, but also provide results on memory usage, indexing time, batch query effects and scalability in terms of CPU cores. Results were averaged over several runs and confirmed on a second, different hardware platform. A particularly interesting observation is that disciplines can and should learn more from each other, with the three best teams rooting in computational linguistics, databases, and bioinformatics, respectively.

机译：字符串相似性搜索及其变体是诸如数据集成，数据质量，计算语言学或生物信息学等领域中许多应用程序的基本问题。在过去的几十年中，已经开发了许多方法。对该领域的最新技术进行概述是困难的，因为结果在各个领域发布而没有太多的相互干扰，论文使用了不同的数据集，并且经常研究核心问题的细微变化以及数量众多。建议的方法超出了单个研究小组的能力。在本文中，我们报告了该领域可能执行的最大基准测试的结果。为了克服资源瓶颈，我们在EDBT / ICDT 2013上组织了基准测试作为国际竞赛和研讨会。来自不同领域和世界各地的各个团队针对两个明确定义的问题开发或调整了程序。所有算法均由外部机器在两台计算机上评估。我们总共比较了14个不同程序在两个字符串匹配问题（k近似搜索和k近似联接）上的问题，这些问题使用的数据集的大小不断增加且具有来自两个不同域的不同特征。我们主要根据时钟时间比较程序，但也提供有关内存使用，索引时间，批处理查询效果和CPU核心可伸缩性的结果。在几次运行中将结果取平均值，并在另一个不同的硬件平台上进行确认。一个特别有趣的发现是，学科之间可以并且应该相互学习更多，三个最佳团队分别扎根于计算语言学，数据库和生物信息学。

著录项

来源
《SIGMOD record》 |2014年第1期|64-76|共13页
作者
Sebastian Wandelt; Shashwat Mishra; Enrico Siragusa; Dong Deng; Petar Mitankin; Alexander Tiskin; Stefan Gerdjikov; Manish Patil; Wei Wang; Jiaying Wang; Ulf Leser;
展开▼
作者单位

Knowledge Management in Bioinformatics, HU Berlin, Berlin, Germany;

Special Interest Group in Data, IIT Kanpur, Kanpur, India;

Algorithmic Bioinformatics, FU Berlin, Berlin, Germany;

Tsinghua University, Beijing, China;

IICT Bulgarian Academy of Sciences, FMI Sofia University, Sofia, Bulgaria;

Department of Computer Science, University of Warwick, United Kingdom;

FMI Sofia University, Sofia, Bulgaria;

Louisiana State University, Louisiana, USA;

University of New South Wales, New South Wales, Australia;

Northeastern University Shenyang, China;

Knowledge Management in Bioinformatics, HU Berlin, Berlin, Germany;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
String search; String join; Scalability; Comparison;

机译：字符串搜索;字符串连接;可扩展性;比较方式;

相似文献

外文文献
中文文献
专利

1. String similarity search and join: a survey [J] . Minghe YU, Guoliang LI, Dong DENG, Frontiers of computer science in China . 2016,第3期

机译：字符串相似性搜索和连接：调查
2. String similarity join with different similarity thresholds based on novel indexing techniques [J] . Chuitian RONG, Yasin N. SILVA, Chunqing LI Frontiers of computer science in China . 2017,第2期

机译：基于新颖索引技术的字符串相似度连接，具有不同的相似度阈值
3. LS-Join: Local Similarity Join on String Collections [J] . Jiaying Wang, Xiaochun Yang, Bin Wang, IEEE Transactions on Knowledge and Data Engineering . 2017,第9期

机译：LS-Join：字符串集合上的局部相似性联接
4. Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join [C] . Jia Cui, Dan Meng, Zhong-Tao Chen Asia information retrieval societies conference . 2014

机译：利用删除邻域和Trie进行有效的字符串相似性搜索和连接
5. String Similarity Joins and Search Under Edit Distance [D] . Zhang, Haoyu. 2020

机译：字符串相似性连接和搜索编辑距离
6. Efficient string similarity join in multi-core and distributed systems [O] . Cairong Yan, Xue Zhao, Qinglong Zhang, 2012

机译：多核和分布式系统中的有效字符串相似性联接
7. State-of-the-art in string similarity search and join [O] . Wandelt, Sebastian, Wang, Jiaying, Leser, Ulf, 2014

机译：字符串相似性搜索和连接的最新技术

State-of-the-art in String Similarity Search and Join

摘要

著录项

相似文献

相关主题

期刊订阅