Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases

Tang Tao; Li Jinyan

首页> 外文期刊>Journal of Bioinformatics and Computational Biology >Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases

【24h】

Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases

机译：FASTA文件将FASTA文件转换为短读数据库无监督压缩的功能向量

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical study. However, current compression of these data sets is carried out one-by-one without consideration of the inter-similarity between the data sets which can be otherwise exploited to enhance compression performance of de novo compression. We show that clustering these data sets into similar sub-groups for a group-by-group compression can greatly improve the compression performance. Our novel idea is to detect the lexicographically smallest k-mer (k-minimizer) for every read in each data set, and uses these k-mers as features and their frequencies in every data set as feature values to transform these huge data sets each into a characteristic feature vector. Unsupervised clustering algorithms are then applied to these vectors to find similar data sets and merge them. As the amount of common k-mers of similar feature values between two data sets implies an excessive proportion of overlapping reads shared between the two data sets, merging similar data sets creates immense sequence redundancy to boost the compression performance. Experiments confirm that our clustering approach can gain up to 12% improvement over several state-of-the-art algorithms in compressing reads databases consisting of 17-100 data sets (48.57-197.97 GB).

机译：在生物医学研究中，FASTA短读数据集通常以数十或数百为单位生成。然而，当前对这些数据集的压缩是一个接一个地进行的，没有考虑数据集之间的相互相似性，这些数据集可以被利用来增强从头压缩的压缩性能。我们发现，将这些数据集聚类成相似的子组进行分组压缩可以极大地提高压缩性能。我们的新想法是检测每个数据集中每次读取的字典最小k-mer（k-minimizer），并将这些k-mer作为特征，将其在每个数据集中的频率作为特征值，将这些庞大的数据集转换为一个特征向量。然后对这些向量应用无监督聚类算法，找到相似的数据集并合并它们。由于两个数据集之间具有相似特征值的公共k-mer的数量意味着两个数据集之间共享的重叠读取的比例过大，因此合并相似数据集会产生巨大的序列冗余，以提高压缩性能。实验证实，在压缩由17-100个数据集（48.57-197.97 GB）组成的reads数据库时，我们的聚类方法比几种最先进的算法可以获得高达12%的改进。

著录项

来源
《Journal of Bioinformatics and Computational Biology》 |2021年第1期|共15页
作者
Tang Tao; Li Jinyan;
展开▼
作者单位

Univ Technol Sydney Fac Engn &

IT Adv Analyt Inst Broadway NSW 2007 Australia;

Univ Technol Sydney Fac Engn &

IT Adv Analyt Inst Broadway NSW 2007 Australia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类细胞生物学;
关键词
Sequence reads; k-minimizer; compression; clustering;

机译：序列读取;k最小化器;压缩;聚类;

相似文献

外文文献
专利

1. Short-read fastA files dataset from complexity-reduced genotyping by sequencing data of bacterial isolates from a public hospital in Australia [J] . Berenice Talamantes-Becerra, Jason Carling, Karina Kennedy, Data in Brief . 2019,第1期

机译：通过对澳大利亚一家公立医院细菌分离株的数据进行测序，从减少复杂性的基因分型中获得了短读的fastA文件数据集
2. Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences [J] . Kryukov Kirill, Ueda Mahoko Takahashi, Nakagawa So, GigaScience . 2020,第7期

机译：序列压缩基准（SCB）数据库 - 对Fasta格式化序列的无参考压缩机的综合评估
3. Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification [J] . Zhang Yipin, Chang Xiaolin, Lin Yuzhou, Quality Control, Transactions . 2020,第期

机译：探索函数调用图矢量化和文件统计功能在恶意PE文件分类中
4. LCTD: A lossless compression tool of FASTQ file based on transformation of original file distribution [C] . Jiabing Fu, Yacong Ma, Bixin Ke, IEEE International Conference on Bioinformatics and Biomedicine . 2016

机译：LCTD：基于原始文件分布转换的FASTQ文件无损压缩工具
5. Feature driven compression and simplification of two-dimensional vector fields. [D] . Renteria, Jose C. 2002

机译：特征驱动的压缩和二维矢量场的简化。
6. Short-read fastA files dataset from complexity-reduced genotyping by sequencing data of bacterial isolates from a public hospital in Australia [O] . Berenice Talamantes-Becerra, Jason Carling, Karina Kennedy, 2019

机译：简短阅读的fastA文件数据集通过对澳大利亚一家公立医院细菌分离株的数据进行测序从降低复杂性的基因分型中获得
7. Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences [O] . Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, 2020

机译：序列压缩基准（SCB）数据库 - 对Fasta格式化序列的无参考压缩机的综合评估
8. Locally adaptive vector quantization: Data compression with feature preservation [R] . Cheung, K. M., Sayano, M. 1992

机译：局部自适应矢量量化：具有特征保留的数据压缩

Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases

摘要

著录项

相似文献

相关主题

期刊订阅