首页> 外文学位 >A top-down approach for mining most specific frequent patterns in biological sequence data.
【24h】

A top-down approach for mining most specific frequent patterns in biological sequence data.

机译:自顶向下的方法,用于挖掘生物序列数据中最特定的频繁模式。

获取原文
获取原文并翻译 | 示例

摘要

The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino-acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottom-up pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly out-performs state-of-the-art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds.
机译:自动化高通量测序技术的出现导致公共数据库中可用的DNA和蛋白质序列数量大大增加。挖掘此类生物序列数据的一种有前途的方法是挖掘频繁的子序列。限制发现的模式数量的一种方法是仅确定包含大量更常规模式的最特定的频繁子序列。在生物学领域,已经获得了有关序列的基础字母(特别是氨基酸)的符号之间的关系的丰富知识,这些知识可以用概念图表示。使用这样的概念图,可以发现更长的频繁模式,从生物学的角度来看,这些模式更有意义。在本文中,我们介绍了在存在概念图的情况下挖掘生物数据中最特定的频繁模式的问题。虽然众所周知的频繁序列挖掘方法通常遵循自下而上的模式生成范例,但我们提出了一种新颖的自上而下的方法(ToMMS)来挖掘此类模式。 ToMMS(1)总是在更通用的模式之前生成更特定的模式,而(2)仅对不频繁的候选序列进行最小化的概括。由于这些特性,所生成和测试的图案数量得以最小化。我们的实验结果表明,对于最低的最低支持门槛,ToMMS明显优于生物信息学界和数据挖掘界的最新方法。

著录项

  • 作者

    Zhang, Xiang.;

  • 作者单位

    Simon Fraser University (Canada).;

  • 授予单位 Simon Fraser University (Canada).;
  • 学科 Computer Science.; Biology General.
  • 学位 M.Sc.
  • 年度 2004
  • 页码 58 p.
  • 总页数 58
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;普通生物学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号