K-means clustering based compression algorithm for the high-throughput DNA sequence

机译：基于K均值聚类的高通量DNA序列压缩算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper proposes a compression algorithm based on K-means clustering for high-through DNA sequence (DNAC-K). In DNAC-K, we create cluster of sequences based on K-means clustering method at first, then iterate clusters according to the edit distances of subsequences, and finally, adopt Huffman coding to encode the result of clustering result. Experimental results on several sequencing data sets demonstrate better performance of DNAC-K than many of the current high-throughput DNA sequence compression algorithms.

机译：针对高通量DNA序列（DNAC-K），提出了一种基于K均值聚类的压缩算法。在DNAC-K中，首先基于K-means聚类方法创建序列聚类，然后根据子序列的编辑距离对聚类进行迭代，最后采用霍夫曼编码对聚类结果进行编码。在多个测序数据集上的实验结果证明，与许多当前的高通量DNA序列压缩算法相比，DNAC-K具有更好的性能。

著录项

来源
《International Conference on Audio, Language and Image Processing》|2014年|952-955|共4页
会议地点
作者
Li Tan; Jifeng Sun;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
DNA; Huffman codes; biology computing; data compression; encoding; pattern clustering; DNA sequence compression algorithms; DNAC-K; Huffman coding; K-means clustering based compression algorithm; edit distances; high-throughput DNA sequence; sequencing data sets; subsequences; Bioinformatics; Clustering algorithms; Clustering methods; Compression algorithms; DNA; Genomics; Huffman coding; DNA sequence compression; Huffman coding; K-means clustering; sequence alignment;

机译：DNA;霍夫曼编码;生物学计算;数据压缩;编码;模式聚类; DNA序列压缩算法; DNAC-K;霍夫曼编码;基于K均值聚类的压缩算法;编辑距离;高通量DNA序列;测序数据集;子序列;生物信息学;聚类算法;聚类方法;压缩算法; DNA;基因组学;霍夫曼编码; DNA序列压缩;霍夫曼编码; K均值聚类;序列比对;

相似文献

外文文献
中文文献
专利

1. Electro-Mechanical Impedance-Based Wireless Structural Health Monitoring Using PCA-Data Compression and k-means Clustering Algorithms [J] . Seunghee Park, Jong-Jae Lee, Chung-Bang Yun, Journal of intelligent material systems and structures . 2008,第4期

机译：使用PCA数据压缩和k均值聚类算法的基于机电阻抗的无线结构健康监测
2. The Hyper-spectral Image Compression Based on K-Means Clustering and Parallel Prediction Algorithm* [J] . Wu Wenbin, Yue Wu, Jintao Li MATEC Web of Conferences . 2018,第1期

机译：基于K均值聚类和并行预测算法的高光谱图像压缩*
3. A novel compression algorithm for infrared thermal image sequence based on K-means method [J] . Jin-Yu Zhang, Wei Xu, Wei Zhang, Infrared physics and technology . 2014,第Null期

机译：基于K-means方法的红外热像序列压缩算法
4. K-means clustering based compression algorithm for the high-throughput DNA sequence [C] . Li Tan, Jifeng Sun International Conference on Audio, Language and Image Processing . 2014

机译：基于K-Means基于高通量DNA序列的聚类压缩算法
5. A K-means based watershed imaging segmentation algorithm for banana cluster quality inspection. [D] . Castillo Cepin, Gregorio Alfonso. 2016

机译：基于K均值的分水岭成像分割算法用于香蕉簇质量检测。
6. An Optimal Seed Based Compression Algorithm for DNA Sequences [O] . Pamela Vinitha Eric, Gopakumar Gopalakrishnan, Muralikrishnan Karunakaran 2016

机译：基于最佳种子的DNA序列压缩算法
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。

K-means clustering based compression algorithm for the high-throughput DNA sequence

摘要

著录项

相似文献

相关主题

期刊订阅