Algorithm for DNA sequence compression based on prediction of mismatch bases and repeat location

机译：基于错配碱基预测和重复定位的DNA序列压缩算法

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

For DNA sequence Compression, it has been observed that methods based on Markov modeling and repeats give best results. However, these methods tend to use uniform distribution assumption of mismatches for approximate repeats. We show that these replacements are not uniformly distributed and we can improve compression efficiency by using non uniform distribution for mismatches. We also propose a hash table based method to predict repeat location which works well for block based genomic sequence compression algorithms. The proposed methods give good compression gains. The method can be incorporated into any algorithm that uses approximate repeats to realize similar gains.

机译：对于DNA序列压缩，已观察到基于Markov建模和重复的方法可获得最佳结果。但是，这些方法倾向于将不匹配的均匀分布假设用于近似重复。我们显示这些替换不是均匀分布的，并且可以通过使用不均匀分布的不匹配项来提高压缩效率。我们还提出了一种基于哈希表的方法来预测重复位置，该方法非常适合基于块的基因组序列压缩算法。所提出的方法具有良好的压缩增益。该方法可以合并到使用近似重复来实现相似增益的任何算法中。

著录项

来源
《2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops》|2010年|p.851-852|共2页
会议地点
作者

展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息处理技术;
关键词

相似文献

外文文献
中文文献
专利

1. DETECTION OF SIGNIFICANT PATTERNS BY COMPRESSION ALGORITHMS - THE CASE OF APPROXIMATE TANDEM REPEATS IN DNA SEQUENCES [J] . Rivals E, Delgrange O, Delahaye JP, Computer Applications in the Biosciences . 1997,第2期

机译：压缩算法检测重要模式-DNA序列中近似串联重复的情况
2. The Nucleotide Sequence, DNA Damage Location, and Protein Stoichiometry Influence the Base Excision Repair Outcome at CAG/ CTG Repeats [J] . Agathi-Vasiliki Goula, Christopher E. Pearson, Julie Della Maria, Biochemistry . 2012,第18期

机译：核苷酸序列，DNA损伤位置和蛋白质化学计量学会影响CAG / CTG重复序列的碱基切除修复结果
3. Detection of DNA sequences with a single-base mismatch on a gold-based and pyrene-assisted platform [J] . Mei-Hwa Lee, Hung-Yin Lin, Hsueh-Wen Chang, Sensors and Actuators . 2018,第auga期

机译：在基于金和pyr的平台上检测具有单碱基错配的DNA序列
4. Algorithm for DNA sequence compression based on prediction of mismatch bases and repeat location [C] . {missing} IEEE International Conference on Bioinformatics and Biomedicine Workshop . 2010

机译：基于不匹配基础的预测和重复位置的DNA序列压缩算法
5. New wavelet-based algorithms for signal decomposition and reconstruction via the theory of circular stationary vector sequences and the Zak transform with applications to image compression. [D] . Polyak, Nikolay. 1998

机译：通过基于圆形平稳矢量序列和Zak变换的理论，基于小波的信号分解和重构新算法在图像压缩中的应用。
6. An Optimal Seed Based Compression Algorithm for DNA Sequences [O] . Pamela Vinitha Eric, Gopakumar Gopalakrishnan, Muralikrishnan Karunakaran 2016

机译：基于最佳种子的DNA序列压缩算法
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。

Algorithm for DNA sequence compression based on prediction of mismatch bases and repeat location

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅