Probabilistic String Similarity Joins

机译：概率串相似性连接

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (deterministic) string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This work studies the string join problem in probabilistic string databases, using the expected edit distance (Eed) as the similarity measure. We first discuss two probabilistic string models to capture the fuzzi-ness in string values in real-world applications. The string-level model is complete, but may be expensive to represent and process. The character-level model has a much more succinct representation when uncertainty in strings only exists at certain positions. Since computing the Eed between two probabilistic strings is prohibitively expensive, we have designed efficient and effective pruning techniques that can be easily implemented in existing relational database engines for both models. Extensive experiments on real data have demonstrated order-of-magnitude improvements of our approaches over the baseline.

机译：编辑基于距离的字符串相似性Join是字符串数据库中的基本操作员。越来越多的数据清洁，数据集成和科学计算中的许多应用程序必须在字符串属性中处理模糊信息。尽管在处理（确定性）字符串加入和管理概率数据时，尽管分别致力于致力于，但建模和处理概率字符串仍然是一个很大程度上未开发的领域。这项工作研究了概率字符串数据库中的字符串连接问题，使用预期的编辑距离（EED）作为相似度测量。我们首先讨论两个概率串模型，以捕获真实应用程序中的字符串值中的fuzzi-ness。字符串级模型是完整的，但代表和处理可能是昂贵的。当字符串中的不确定性仅存在于某些位置时，字符级模型具有更加简洁的表示。由于计算了两个概率串之间的EED，因此设计了设计的高效且有效的修剪技术，这些技术可以在适用于两个模型的现有关系数据库引擎中容易地实现。关于实际数据的广泛实验已经表现出我们在基线上的方法的级别提高。

著录项

来源
《ACM SIGMOD international conference on management of data》|2010年||共12页
会议地点
作者

展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词
probabilistic strings; approximate string queries; string joins;

机译：概率字符串;近似字符串查询;串加入;

相似文献

外文文献
中文文献
专利

1. String similarity join with different similarity thresholds based on novel indexing techniques [J] . Chuitian RONG, Yasin N. SILVA, Chunqing LI Frontiers of computer science in China . 2017,第2期

机译：基于新颖索引技术的字符串相似度连接，具有不同的相似度阈值
2. LS-Join: Local Similarity Join on String Collections [J] . Jiaying Wang, Xiaochun Yang, Bin Wang, IEEE Transactions on Knowledge and Data Engineering . 2017,第9期

机译：LS-Join：字符串集合上的局部相似性联接
3. Para-Join: an efficient parallel method for string similarity join [J] . Cairong Yan, Jian Wang, Bin Zhu, International Journal of High Performance Computing and Networking . 2017,第4a5期

机译：Para-Join：字符串相似性连接的有效并行方法
4. Probabilistic String Similarity Joins [C] . Jeffrey Jestes, Feifei Li, Zhepeng Yan, ACM SIGMOD international conference on management of data;SIGMOD 2010 . 2010

机译：概率字符串相似联接
5. String Similarity Joins and Search Under Edit Distance [D] . Zhang, Haoyu. 2020

机译：字符串相似性连接和搜索编辑距离
6. Efficient string similarity join in multi-core and distributed systems [O] . Cairong Yan, Xue Zhao, Qinglong Zhang, 2012

机译：多核和分布式系统中的有效字符串相似性联接
7. Probabilistic string similarity joins [O] . Jestes, J., Li, F., Yan, Z., 2010

机译：概率字符串相似性联接

Probabilistic String Similarity Joins

摘要

著录项

相似文献

相关主题

期刊订阅