【24h】

Probabilistic String Similarity Joins

机译:概率串相似性连接

获取原文

摘要

Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (deterministic) string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This work studies the string join problem in probabilistic string databases, using the expected edit distance (Eed) as the similarity measure. We first discuss two probabilistic string models to capture the fuzzi-ness in string values in real-world applications. The string-level model is complete, but may be expensive to represent and process. The character-level model has a much more succinct representation when uncertainty in strings only exists at certain positions. Since computing the Eed between two probabilistic strings is prohibitively expensive, we have designed efficient and effective pruning techniques that can be easily implemented in existing relational database engines for both models. Extensive experiments on real data have demonstrated order-of-magnitude improvements of our approaches over the baseline.
机译:编辑基于距离的字符串相似性Join是字符串数据库中的基本操作员。越来越多的数据清洁,数据集成和科学计算中的许多应用程序必须在字符串属性中处理模糊信息。尽管在处理(确定性)字符串加入和管理概率数据时,尽管分别致力于致力于,但建模和处理概率字符串仍然是一个很大程度上未开发的领域。这项工作研究了概率字符串数据库中的字符串连接问题,使用预期的编辑距离(EED)作为相似度测量。我们首先讨论两个概率串模型,以捕获真实应用程序中的字符串值中的fuzzi-ness。字符串级模型是完整的,但代表和处理可能是昂贵的。当字符串中的不确定性仅存在于某些位置时,字符级模型具有更加简洁的表示。由于计算了两个概率串之间的EED,因此设计了设计的高效且有效的修剪技术,这些技术可以在适用于两个模型的现有关系数据库引擎中容易地实现。关于实际数据的广泛实验已经表现出我们在基线上的方法的级别提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号