...
首页> 外文期刊>Journal of Bioinformatics and Computational Biology >Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases
【24h】

Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases

机译:FASTA文件将FASTA文件转换为短读数据库无监督压缩的功能向量

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical study. However, current compression of these data sets is carried out one-by-one without consideration of the inter-similarity between the data sets which can be otherwise exploited to enhance compression performance of de novo compression. We show that clustering these data sets into similar sub-groups for a group-by-group compression can greatly improve the compression performance. Our novel idea is to detect the lexicographically smallest k-mer (k-minimizer) for every read in each data set, and uses these k-mers as features and their frequencies in every data set as feature values to transform these huge data sets each into a characteristic feature vector. Unsupervised clustering algorithms are then applied to these vectors to find similar data sets and merge them. As the amount of common k-mers of similar feature values between two data sets implies an excessive proportion of overlapping reads shared between the two data sets, merging similar data sets creates immense sequence redundancy to boost the compression performance. Experiments confirm that our clustering approach can gain up to 12% improvement over several state-of-the-art algorithms in compressing reads databases consisting of 17-100 data sets (48.57-197.97 GB).
机译:在生物医学研究中,FASTA短读数据集通常以数十或数百为单位生成。然而,当前对这些数据集的压缩是一个接一个地进行的,没有考虑数据集之间的相互相似性,这些数据集可以被利用来增强从头压缩的压缩性能。我们发现,将这些数据集聚类成相似的子组进行分组压缩可以极大地提高压缩性能。我们的新想法是检测每个数据集中每次读取的字典最小k-mer(k-minimizer),并将这些k-mer作为特征,将其在每个数据集中的频率作为特征值,将这些庞大的数据集转换为一个特征向量。然后对这些向量应用无监督聚类算法,找到相似的数据集并合并它们。由于两个数据集之间具有相似特征值的公共k-mer的数量意味着两个数据集之间共享的重叠读取的比例过大,因此合并相似数据集会产生巨大的序列冗余,以提高压缩性能。实验证实,在压缩由17-100个数据集(48.57-197.97 GB)组成的reads数据库时,我们的聚类方法比几种最先进的算法可以获得高达12%的改进。

著录项

相似文献

  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号