首页> 外文会议>International workshop on database and expert systems applications >The Role of Machine Learning in Finding Chimeric RNAs
【24h】

The Role of Machine Learning in Finding Chimeric RNAs

机译:机器学习在寻找嵌合RNA中的作用

获取原文

摘要

High-throughput sequencing technology and bioinformatics have identified chimeric RNAs (chRNAs), raising the possibility of chRNAs expressing particularly in diseases can be used as potential biomarkers in both diagnosis and prognosis. The task of discriminating true chRNA from the false ones poses an interesting Machine Learning (ML) challenge. First of all, the sequencing data may contain false reads due to technical artefacts and during the analysis process, bioinformatics tools may generate false positives due to methodological biases. Thus predicting the real signal from the noise can be a hard task. Furthermore, even if we succeed to have a proper set of observations (enough sequencing data) about true chRNAs, chances are that the devised model can not be able to generalize beyond it. Like any other machine learning problem, the first big issue is finding the good data, observations, to build the prediction model. Unfortunately, as far as we were concerned, there is no common benchmark data available for chRNAs. And, the definition of a classification baseline is lacking in the related literature. In this work we are moving towards a benchmark data and a fair comparison analysis unraveling the role of ML techniques in finding chRNAs. We have developed a benchmark pipeline incorporating a mutated genome process and simulated RNA-seq data by Flux Simulator. These sequencing reads were aligned and annotated by CRAC. CRAC offers a new way to analyze the RNA-seq data by integrating genomic location and local coverage, allowing biological predictions in one step. The resulting data were used as a benchmark for our comparison analysis. We have observed that the no free lunch theorem do not hold for ensemble classifiers. Ensemble learning strategies demonstrated to be more robust to this classification problem, providing an average AUC performance of 95% (ACC=94%, Kappa=0.87%).
机译:高通量测序技术和生物信息学已经鉴定出嵌合RNA(chRNA),从而提高了chRNA特别在疾病中表达的可能性,可以在诊断和预后中用作潜在的生物标志物。区分真正的chRNA和错误的chRNA的任务提出了一个有趣的机器学习(ML)挑战。首先,测序数据可能会由于技术伪像而包含错误的读数,并且在分析过程中,生物信息学工具可能会由于方法上的偏见而产生错误的阳性结果。因此,根据噪声预测真实信号可能是一项艰巨的任务。此外,即使我们成功地获得了关于真实chRNA的一组正确的观察结果(足够的测序数据),但很有可能设计的模型无法对此进行概括。像其他任何机器学习问题一样,第一个大问题是寻找良好的数据,观察结果以建立预测模型。不幸的是,就我们而言,尚无通用的chRNA基准数据。并且,相关文献中缺乏分类基线的定义。在这项工作中,我们正朝着基准数据和公平比较分析的方向发展,从而阐明了ML技术在寻找chRNA中的作用。我们已经开发了一个基准流水线,其中整合了突变的基因组过程和Flux Simulator模拟的RNA-seq数据。这些测序读段由CRAC进行比对和注释。 CRAC提供了一种通过整合基因组位置和局部覆盖范围来分析RNA-seq数据的新方法,从而一步就可以进行生物学预测。所得数据用作我们比较分析的基准。我们已经观察到没有免费午餐定理不适用于集合分类器。集成学习策略对这种分类问题表现出更强的鲁棒性,提供了95%的平均AUC性能(ACC = 94%,Kappa = 0.87%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号