...
首页> 外文期刊>BMC Bioinformatics >Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications
【24h】

Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications

机译:朝向生物信息学应用的学习算法的主导平行实现

获取原文

摘要

BackgroundThe huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n3) and of O(n5) order, respectively, and so, the algorithm is unaffordable for huge data sets.ResultsWe find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed.ConclusionsIn the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one.
机译:背景,生物医学研究中产生的大量数据需要复杂的算法方法,以便其存储,分析和处理。高性能计算(HPC)在这一挑战中显示为魔法子弹。然而,在这种情况下出现了几种难以解决的并行化和负载平衡问题。在这里,我们讨论了一般目的学习算法的HPC导向的实现,最初构思了DNA分析,最近扩展以治疗数据(U-Brab)的不确定性。 U-Brain算法是一种学习算法,该算法可以在分解正常形式(DNF)中的布尔公式,其具有大致最小复杂度,其与可能具有丢失位的一组数据(实例)一致。通过从给定的数据识别必须满足所有正文的条件系列的一组条件,以迭代方式计算公式的联合术语;这种条件允许计算每个属性(文字)的一组系数(相关性),其形成概率分布,允许选择术语文字。其特征的巨大多功能性使U-Brain适用于许多领域,其中存在有待数据的数据。但是,运行所需的内存和执行时间为O(n 3 )和o(n 5 )顺序分别为且算法对巨大的数据集无能为力。方法可以找到能够引导我们在并行计算机上实现算法U-Brain的实现的数学和编程解决方案。首先,我们给出了U-Brain算法的动态编程模型,然后我们最大限度地减少了相关性的表示。当数据大小的大小时,我们被迫使用质量存储器,并且根据数据实际存储的位置,访问时间可以是完全不同的。根据基于磁盘模型的算法效率的评估,为了降低不同存储器(RAM,高速缓存,质量,虚拟)之间通信的成本,并实现高效的I / O性能,我们设计了一种能够的大容量存储结构通过高度的时间和空间位置访问其数据。然后我们开发算法的并行实现。我们将其塑造为SPMD系统,将消息传递编程范例。在这里,我们采用Java编程语言,MPJ的版本中的高级消息传递系统MPI(消息传递接口)。并行处理被组织成四个阶段:分区,通信,集聚和映射。 U-Brain算法的分解确定所涉及的处理器之间的通信协议设计的必要性。还讨论了高效的同步设计.Conclusions在公共和私人机构之间合作的背景下,在英特尔Xeon E7xxx和E5xxx的Cresco结构的新技术机构的Cresco结构的“e5xxx系列”中已经实施和测试了U-Brain的并行模型,在欧洲电网基础设施(EGI)框架内开发的能源和可持续经济发展(ENEA),一系列努力,可以使用网格计算技术在欧洲提供高吞吐量计算资源的努力。实现能够最小化存储空间和执行时间。本研究中使用的测试数据是IPDATA(IRVINE灵长类会剪接数据集),HS3D(HOMO SAPIENS剪接站点数据集)的子集和宇宙的子集(癌症中躯体突变的目录)。执行时间和IPDATA上的加速达到约90个处理器内的最佳值。然后,并行化优势通过处理器之间的非本地通信的成本较高。在HS3D上具有类似的行为,但在更多的处理器中,因此在数据大小和并行化增益之间证明了直接关系。在宇宙上确认此行为。总的来说,所获得的结果表明,并行版本比串行版本快30倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号