Ensemble Learning for Large Scale Virtual Screening on Apache Spark

机译：在Apache Spark上进行大规模虚拟筛选的集成学习

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Virtual screening (VS) is an in-silico tool for drug discovery that aims to identify the candidate drugs through computational techniques by screening large libraries of small molecules. Various ligand and structure-based virtual screening approaches have been proposed in the last decades. Machine learning (ML) techniques have been widely applied in drug discovery and development process, predominantly in ligand based virtual screening approaches. Ensemble learning is a very common paradigm in ML field, where many models are trained on the same problem's data, to combine in the end the results in one improved prediction. Applying VS to massive molecular libraries (Big Data) is computationally intensive; so the split of these data to chunks to parallelize and distribute the task became necessary. For many years, MapReduce has been successfully applied on clusters to solve the problems with very large datasets, but with some limitations. Apache Spark is an open source framework for Big Data processing, which overcomes the shortcomings of MapReduce. In this paper, we propose a new approach based on ensemble learning paradigm in Apache Spark to improve in terms of execution time and precision the large-scale virtual screening. We generate a new training dataset to evaluate our approach. The experimental results show a good predictive performance up to 92% precision with an acceptable execution time.

机译：虚拟筛选（VS）是一种用于药物发现的计算机内工具，旨在通过筛选小分子的大型文库，通过计算技术来识别候选药物。在过去的几十年中已经提出了各种基于配体和结构的虚拟筛选方法。机器学习（ML）技术已广泛应用于药物发现和开发过程，主要用于基于配体的虚拟筛选方法。集成学习是机器学习领域中非常普遍的范例，在该模型中，针对同一问题的数据训练了许多模型，最终将结果组合成一个改进的预测。将VS应用于庞大的分子库（大数据）需要大量计算。因此，将这些数据拆分为多个块以并行化和分配任务变得很有必要。多年来，MapReduce已成功应用于集群，以解决非常大的数据集的问题，但存在一些局限性。 Apache Spark是用于大数据处理的开源框架，它克服了MapReduce的缺点。在本文中，我们提出了一种基于Apache Spark中集成学习范式的新方法，以从执行时间和精度上改善大规模虚拟筛选。我们生成一个新的训练数据集来评估我们的方法。实验结果表明，在可接受的执行时间下，高达92％的精度具有良好的预测性能。

著录项

来源
《Computational intelligence and its applications》|2018年|244-256|共13页
会议地点 Oran(DZ)
作者
Karima Sid; Mohamed Batouche;
展开▼
作者单位

Computer Science Department, University of Constantine 2 - Abdelhamid Mehri, Constantine, Algeria;

Computer Science Department, University of Constantine 2 - Abdelhamid Mehri, Constantine, Algeria;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Virtual screening; Big data; Apache Spark; Machine learning Ensemble learning;

机译：虚拟筛选；大数据; Apache Spark;机器学习集成学习;

相似文献

外文文献
中文文献
专利

1. Distributed heterogeneous ensemble learning on Apache Spark for ligand-based virtual screening [J] . Sid Karima, Batouche Mohamed International journal of data mining, modelling and management . 2021,第1a2期

机译：基于配体的虚拟筛选的Apache Spark上分布式异构集合学习
2. Large-scale virtual screening on public cloud resources with Apache Spark [J] . Marco Capuccini, Laeeq Ahmed, Wesley Schaal, Journal of Cheminformatics . 2017,第1期

机译：使用Apache Spark对公共云资源进行大规模虚拟筛选
3. Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics [J] . Lunga Dalton, Gerrand Jonathan, Yang Lexie, Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of . 2020,第期

机译：Apache Spark加速了大规模卫星图像分析的深度学习推断
4. Ensemble Learning for Large Scale Virtual Screening on Apache Spark [C] . Karima Sid, Mohamed Batouche IFIP TC 5 international conference on computational intelligence and its applications . 2018

机译：在Apache Spark上进行大规模虚拟筛选的集合学习
5. GeoSparkSim: A Scalable Microscopic Road Network Traffic Simulator Based on Apache Spark [D] . Fu, Zishan 2019

机译：GeoSparkSim：基于Apache Spark的可扩展的微观道路网络交通模拟器
6. Large-scale virtual screening on public cloud resources with Apache Spark [O] . Marco Capuccini, Laeeq Ahmed, Wesley Schaal, 2017

机译：使用Apache Spark对公共云资源进行大规模虚拟筛选
7. Large-scale virtual screening on public cloud resources with Apache Spark [O] . 2017

机译：使用Apache Spark对公共云资源进行大规模虚拟筛选

Ensemble Learning for Large Scale Virtual Screening on Apache Spark

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅