首页> 外文会议>Pattern recognition in bioinformatics >Counting Patterns in Degenerated Sequences
【24h】

Counting Patterns in Degenerated Sequences

机译:退化序列中的计数模式

获取原文
获取原文并翻译 | 示例

摘要

Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach.
机译:生物序列(例如DNA或蛋白质)总是通过测序过程获得,这可能会产生一些不确定性。结果,这样的序列通常以简并的字母书写,其中某些符号可能对应于几个可能的字母(例如:IUPAC DNA字母)。当计算这种退化序列中的模式时,自然会出现一个问题:如何处理退化位置?由于大多数位置(通常为99%)没有退化,因此丢弃退化位置以进行观察被认为是无害的,但是这种做法的确切结果尚不清楚。在本文中,我们介绍了一种严格的方法来考虑生物序列(DNA,蛋白质)测序的不确定性。我们首先介绍一种向前-向后方法来计算约束序列的边际分布,并使用它来执行参数的期望最大化估计,以及导出约束序列的异构马尔可夫分布。因此,此分布与已知的基于DFA的图案方法一起使用,以获得约束条件下图案计数的精确分布。作为说明,我们考虑了EMBL数据库中的EST数据集。尽管事实上该数据集中只有1%的位置已退化,但我们表明,不考虑这些位置可能会导致错误的观察结果,从而进一步证明了我们方法的重要性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号