首页> 外文会议>Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining >Study of transductive learning and unsupervised feature construction methods for biological sequence classification
【24h】

Study of transductive learning and unsupervised feature construction methods for biological sequence classification

机译:用于生物序列分类的转导学习和无监督特征构建方法研究

获取原文
获取原文并翻译 | 示例

摘要

Next Generation Sequencing (NGS) technologies have led to fast and inexpensive production of large amounts of biological sequence data, including nucleotide sequences and derived protein sequences. These fast-increasing volumes of data pose challenges to computational methods for annotation. Machine learning approaches, primarily supervised algorithms, have been widely used to assist with classification tasks in bioinformatics. However, supervised algorithms rely on large amounts of labeled data in order to produce quality predictors. Oftentimes, labeled data is difficult and expensive to acquire in sufficiently large quantities. When only limited amounts of labeled data but considerably larger amounts of unlabeled data are available for a specific annotation problem, semi-supervised learning approaches represent a cost-effective alternative. In this work, we focus on a special case of semi-supervised learning, namely transductive learning, in which the algorithm has access during the training phase to the instances that need to be labeled. Transduction is particularly suitable for biological sequence classification, where the goal is generally to label a given set of unlabeled instances. However, a challenge that needs to be addressed in this context consists of identification of compact sets of informative features. Given the lack of labeled data, standard supervised feature selection methods may result in unreliable features. Therefore, we study recently proposed unsupervised feature construction approaches together with transductive learning. Experimental results on two classification problems, namely cassette exon identification and protein localization, show that the unsupervised features result in better performance than the supervised features.
机译:下一代测序(NGS)技术已导致快速,廉价地产生大量生物序列数据,包括核苷酸序列和衍生的蛋白质序列。这些快速增长的数据量给注释的计算方法带来了挑战。机器学习方法(主要是监督算法)已广泛用于协助生物信息学中的分类任务。但是,监督算法依赖于大量标记数据以产生质量预测指标。通常,以足够的数量获取带标签的数据既困难又昂贵。当仅有限数量的标记数据但大量的未标记数据可用于特定注释问题时,半监督学习方法代表了一种经济高效的选择。在这项工作中,我们专注于半监督学习的一种特殊情况,即转导学习,在这种情况下,算法可以在训练阶段访问需要标记的实例。转导特别适合于生物学序列分类,其中目标通常是标记一组给定的未标记实例。但是,在这种情况下需要解决的挑战包括识别紧凑的信息特征集。鉴于缺少标记数据,标准的受监督特征选择方法可能会导致特征不可靠。因此,我们研究了最近提出的无监督特征构建方法以及跨语言学习。对两个分类问题的实验结果,即盒式外显子鉴定和蛋白质定位,表明无监督的特征比监督的特征具有更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号