A Top-Down Method for Mining Most Specific Frequent Patterns in Biological Sequence Data

机译：用于生物序列数据中最特定的频繁模式的自上而下方法

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottomup pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly outperforms state-of-the- art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds.

机译：自动化高通量测序技术的出现导致公共数据库中可用的DNA和蛋白质序列的量大增加。挖掘此类生物序列数据的有希望的方法正在挖掘频繁的局部术语。限制所发现的模式数量的一种方法是仅确定最常见的频繁子句，该频繁的常见后续归存大量更多的常规模式。在生物结构域中，已经获得了对序列的底形字母（特别是氨基酸）的符号之间的关系的丰富知识，其可以在概念图中表示。使用此类概念图，可以发现从生物学的角度来发现更长的频繁模式。在本文中，我们在概念图的存在下介绍了生物数据中的最具体的频繁模式的问题。虽然频繁序列挖掘的众所周知的方法通常遵循自下而上的模式生成的范式，但我们提出了一种用于采矿此类模式的新型自上而下方法（Tomms）。 TOMMS（1）始终在更一般的概念和（2）之前生成更具体的模式，仅执行不常见的候选序列的最小概括。由于这些属性，产生和测试的模式的数量最小化。我们的实验结果表明，Tomms显然从生物信息学区社区以及从数据挖掘社区提供了合理低的最低支持阈值的最新方法。

著录项

来源
《International Conference on Data Mining》|2004年||共12页
会议地点
作者
Martin Ester; Xiang Zhang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP274.2-53;
关键词
Data mining algorithms; Sequence data; Efficiency; Data mining applications in bioinformatics;

机译：数据挖掘算法;序列数据;效率;数据挖掘在生物信息学中的应用;

相似文献

外文文献
中文文献
专利

1. SEARCHPATTOOL: a new method for mining the most specific frequent patterns for binding sites with application to prokaryotic DNA sequences [J] . Fathi Elloumi, Martha Nason BMC Bioinformatics . 2007,第1期

机译：SearchPattool：一种新的方法，用于挖掘最常见的频繁模式，用于将绑定站点施用到原核DNA序列
2. Top-down mining of frequent closed patterns from very high dimensional data [J] . Liu HY, Wang XY, He J, Information Sciences: An International Journal . 2009,第7期

机译：从超高维数据自上而下地挖掘频繁的闭合模式
3. Top-down mining of frequent closed patterns from very high dimensional data [J] . Hongyan Liu, Xiaoyu Wang, Jun He, Information Sciences: An International Journal . 2008,第7期

机译：从超高维数据自上而下地挖掘频繁的闭合模式
4. A Top-Down Method for Mining Most Specific Frequent Patterns in Biological Sequence Data [C] . Martin Ester, Xiang Zhang International Conference on Data Mining . 2004

机译：用于生物序列数据中最特定的频繁模式的自上而下方法
5. A top-down approach for mining most specific frequent patterns in biological sequence data. [D] . Zhang, Xiang. 2004

机译：自顶向下的方法，用于挖掘生物序列数据中最特定的频繁模式。
6. SEARCHPATTOOL: a new method for mining the most specific frequent patterns for binding sites with application to prokaryotic DNA sequences [O] . Fathi Elloumi, Martha Nason 2007

机译：SEARCHPATTOOL：一种新的方法用于挖掘最常见的结合位点频繁模式并应用于原核DNA序列
7. A Top-Down Method for Mining Most Specific Frequent Patterns in Biological Sequence Data [O] . Martin Ester, Xiang Zhang 2004

机译：用于生物序列数据中最特定的频繁模式的自上而下方法

A Top-Down Method for Mining Most Specific Frequent Patterns in Biological Sequence Data

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅