首页> 外文会议>IEEE International Conference on Bioinformatics Biomedicine >Protein Sequence Classification Using Feature Hashing
【24h】

Protein Sequence Classification Using Feature Hashing

机译:使用特征散列蛋白质序列分类

获取原文

摘要

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is reduced by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and aggregating their counts. We compare feature hashing with the bag of k-grams and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
机译:下一代测序技术的最新进展导致蛋白质序列数据的指数增加。用于蛋白质序列分类的K-GRAM表示通常导致千维克的非常高的尺寸输入空间。将数据挖掘算法应用于这些输入空间可能由于大量尺寸而具有难以相容的。因此,使用维度降低技术对于学习算法的性能和复杂性来说可能是至关重要的。我们研究特征散列对蛋白质序列分类的适用性,其中通过将特征映射到哈希键来减少原始的高维空间,使得多个特征可以映射到相同的密钥并聚合它们的计数。我们将功能散列与K-GRAM袋和特征选择方法进行比较。我们的结果表明,特征散列是减少蛋白质序列分类任务的维度的有效方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号