首页> 外文会议>International Conference on Data Mining >A Top-Down Method for Mining Most Specific Frequent Patterns in Biological Sequence Data
【24h】

A Top-Down Method for Mining Most Specific Frequent Patterns in Biological Sequence Data

机译:用于生物序列数据中最特定的频繁模式的自上而下方法

获取原文

摘要

The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottomup pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly outperforms state-of-the- art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds.
机译:自动化高通量测序技术的出现导致公共数据库中可用的DNA和蛋白质序列的量大增加。挖掘此类生物序列数据的有希望的方法正在挖掘频繁的局部术语。限制所发现的模式数量的一种方法是仅确定最常见的频繁子句,该频繁的常见后续归存大量更多的常规模式。在生物结构域中,已经获得了对序列的底形字母(特别是氨基酸)的符号之间的关系的丰富知识,其可以在概念图中表示。使用此类概念图,可以发现从生物学的角度来发现更长的频繁模式。在本文中,我们在概念图的存在下介绍了生物数据中的最具体的频繁模式的问题。虽然频繁序列挖掘的众所周知的方法通常遵循自下而上的模式生成的范式,但我们提出了一种用于采矿此类模式的新型自上而下方法(Tomms)。 TOMMS(1)始终在更一般的概念和(2)之前生成更具体的模式,仅执行不常见的候选序列的最小概括。由于这些属性,产生和测试的模式的数量最小化。我们的实验结果表明,Tomms显然从生物信息学区社区以及从数据挖掘社区提供了合理低的最低支持阈值的最新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号