首页> 外文会议>Scientific and statistical database management >Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

【24h】

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

机译：用于基因组数据相似性搜索和相似性联接的前缀树索引

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing, or comparison of biological sequences. Especially DNA sequencing produces large collections of erroneous strings which need to be searched, compared, and merged. However, current RDBMS offer similarity operations only in a very limited and inefficient form that does not scale to the amount of data produced in Life Science projects. We present PETER, a prefix tree based indexing algorithm supporting approximate search and approimate joins. Our tool supports Hamming and edit distance as similarity measure and is available as C++ library, as Unix command line tool, and as cartridge for a commercial database. It combines an efficient implementation of compressed prefix trees with advanced pre-filtering techniques that exclude many candidate strings early. The achieved speed-ups are dramatic, especially for DNA with its small alphabet. We evaluate our tool on several collections of long strings containing up to 5,000,000 entries of length up to 3,500. We compare its performance to agrep, nrgrep, and user-defined functions inside a relational database. Our experiments reveal that PETER is faster by orders of magnitudes compared to the command-line tools. Compared to RDBMS, it computes similarity joins in minutes for which UDFs did not finish within a day and outperforms the built-in join methods even in the exact case.

机译：字符串上的相似性搜索和相似性连接对于诸如重复检测，错误检测，数据清理或生物序列比较之类的应用非常重要。尤其是DNA测序会产生大量错误的字符串，需要对其进行搜索，比较和合并。但是，当前的RDBMS仅以非常有限且效率低下的形式提供相似性操作，无法适应生命科学项目中产生的数据量。我们提出PETER，这是一种基于前缀树的索引算法，支持近似搜索和近似联接。我们的工具支持汉明和编辑距离作为相似性度量，并且可以作为C ++库，作为Unix命令行工具以及作为商业数据库的墨盒使用。它结合了有效的压缩前缀树实现和先进的预过滤技术，这些技术可以尽早排除许多候选字符串。所实现的加速是惊人的，特别是对于带有小字母的DNA。我们在几个长字符串集合中评估我们的工具，这些字符串包含最多5,000,000个条目，长度最大为3500。我们将其性能与关系数据库中的agrep，nrgrep和用户定义的函数进行比较。我们的实验表明，PETER比命令行工具要快几个数量级。与RDBMS相比，它可以在几分钟内计算出相似的联接，而UDF在一天之内没有完成，即使在精确的情况下，其性能也优于内置的联接方法。

著录项

来源
《Scientific and statistical database management 》|2010年|p.519-536|共18页
会议地点 Heidelberg(DE);Heidelberg(DE)
作者
Astrid Rheinlaender; Martin Knobloch; Nicky Hochmuth; Ulf Leser;
展开▼
作者单位

Humboldt-Universitat zu Berlin Department of Computer Science Berlin, Germany;

Humboldt-Universitat zu Berlin Department of Computer Science Berlin, Germany;

Humboldt-Universitat zu Berlin Department of Computer Science Berlin, Germany;

Humboldt-Universitat zu Berlin Department of Computer Science Berlin, Germany;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类 TP311.13;
关键词

相似文献

外文文献
中文文献
专利

1. Efficient Metric Indexing for Similarity Search and Similarity Joins [J] . Lu Chen, Yunjun Gao, Xinhan Li, IEEE Transactions on Knowledge and Data Engineering . 2017 ,第3期

机译：用于相似搜索和相似联接的有效度量索引
2. Efficient Metric Indexing for Similarity Search and Similarity Joins [J] . Lu Chen, Yunjun Gao, Xinhan Li, Theoretical and Experimental Plant Physiology . 2017 ,第3期

机译：相似性搜索和相似性的高效度量索引
3. Tree Based Fast Similarity Query Search Indexing on Outsourced Cloud Data Streams [J] . Balasubramanian Balamurugan, Durai Kamalraj, Sathyanarayanan Jegadeeswari, The international arab journal of information technology . 2019 ,第5期

机译：基于树的快速相似性查询在外包云数据流上搜索索引
4. Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data [C] . Astrid Rheinlander, Martin Knobloch, Nicky Hochmuth, International Conference on Scientific and Statistical Database Management . 2010

机译：相似性搜索的前缀树索引和相似性在基因组数据上加入
5. Mining, indexing and similarity search in large graph data sets. [D] . Yan, Xifeng. 2006

机译：大型图数据集中的挖掘，索引和相似性搜索。
6. Quadrant-Based Minimum Bounding Rectangle-Tree Indexing Method for Similarity Queries over Big Spatial Data in HBase [O] . Bumjoon Jo, Sungwon Jung 2018

机译：HBase中大空间数据相似性查询的基于象限的最小边界矩形树索引方法
7. Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data [O] . Astrid Rheinländer, Martin Knobloch, Nicky Hochmuth, 2013

机译：用于基因组数据相似性搜索和相似性加入的前缀树索引

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

摘要

著录项

相似文献

相关主题

期刊订阅