首页> 外文会议>Scientific and statistical database management >Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data
【24h】

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

机译:用于基因组数据相似性搜索和相似性联接的前缀树索引

获取原文
获取原文并翻译 | 示例

摘要

Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing, or comparison of biological sequences. Especially DNA sequencing produces large collections of erroneous strings which need to be searched, compared, and merged. However, current RDBMS offer similarity operations only in a very limited and inefficient form that does not scale to the amount of data produced in Life Science projects. We present PETER, a prefix tree based indexing algorithm supporting approximate search and approimate joins. Our tool supports Hamming and edit distance as similarity measure and is available as C++ library, as Unix command line tool, and as cartridge for a commercial database. It combines an efficient implementation of compressed prefix trees with advanced pre-filtering techniques that exclude many candidate strings early. The achieved speed-ups are dramatic, especially for DNA with its small alphabet. We evaluate our tool on several collections of long strings containing up to 5,000,000 entries of length up to 3,500. We compare its performance to agrep, nrgrep, and user-defined functions inside a relational database. Our experiments reveal that PETER is faster by orders of magnitudes compared to the command-line tools. Compared to RDBMS, it computes similarity joins in minutes for which UDFs did not finish within a day and outperforms the built-in join methods even in the exact case.
机译:字符串上的相似性搜索和相似性连接对于诸如重复检测,错误检测,数据清理或生物序列比较之类的应用非常重要。尤其是DNA测序会产生大量错误的字符串,需要对其进行搜索,比较和合并。但是,当前的RDBMS仅以非常有限且效率低下的形式提供相似性操作,无法适应生命科学项目中产生的数据量。我们提出PETER,这是一种基于前缀树的索引算法,支持近似搜索和近似联接。我们的工具支持汉明和编辑距离作为相似性度量,并且可以作为C ++库,作为Unix命令行工具以及作为商业数据库的墨盒使用。它结合了有效的压缩前缀树实现和先进的预过滤技术,这些技术可以尽早排除许多候选字符串。所实现的加速是惊人的,特别是对于带有小字母的DNA。我们在几个长字符串集合中评估我们的工具,这些字符串包含最多5,000,000个条目,长度最大为3500。我们将其性能与关系数据库中的agrep,nrgrep和用户定义的函数进行比较。我们的实验表明,PETER比命令行工具要快几个数量级。与RDBMS相比,它可以在几分钟内计算出相似的联接,而UDF在一天之内没有完成,即使在精确的情况下,其性能也优于内置的联接方法。

著录项

  • 来源
  • 会议地点 Heidelberg(DE);Heidelberg(DE)
  • 作者单位

    Humboldt-Universitat zu Berlin Department of Computer Science Berlin, Germany;

    Humboldt-Universitat zu Berlin Department of Computer Science Berlin, Germany;

    Humboldt-Universitat zu Berlin Department of Computer Science Berlin, Germany;

    Humboldt-Universitat zu Berlin Department of Computer Science Berlin, Germany;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 TP311.13;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号