...
首页> 外文期刊>Bioinformatics >Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.
【24h】

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.

机译:通过多个序列的统计学显着比对鉴定DNA和蛋白质模式。

获取原文
获取原文并翻译 | 示例
           

摘要

MOTIVATION: Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. RESULTS: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein. AVAILABILITY: Programs were developed under the UNIX operating system and are available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus.
机译:动机:分子生物学家经常可以通过比对一组相关的DNA,RNA或蛋白质序列来获得有趣的见解。这种比对可用于确定进化或功能关系。我们的兴趣是确定功能关系。除非序列非常相似,否则必须有一种特定的策略来测量或评分比对序列的相关性。如果对齐方式未知,则可以通过找到优化评分方案的对齐方式来确定对齐方式。结果:我们描述了用于确定多个序列比对的方法的四个组成部分。首先,我们回顾一个称为信息内容的对数似然评分方案。其次,我们描述了两种用于估计单个信息内容得分的P值的方法:(i)一种将大偏差统计技术与数值计算相结合的方法; (ii)专用于数值的方法。第三,我们描述在给定序列数据总量的情况下如何计算可能的比对次数。将该计数乘以P值即可确定信息内容得分的预期频率,从而确定相应比对的统计显着性。统计显着性可用于比较具有不同宽度和包含不同数量序列的比对。第四,我们描述了一种贪婪算法,用于确定功能相关序列的比对。最后,我们测试了P值计算的准确性,并举例说明了使用我们的算法识别大肠杆菌CRP蛋白的结合位点。可用性:程序是在UNIX操作系统下开发的,可以从ftp://beagle.colorado.edu/pub/consensus中通过匿名ftp获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号