首页> 外文学位 >Biological sequence analysis using Hadoop/MapReduce as a distributed computing model.
【24h】

Biological sequence analysis using Hadoop/MapReduce as a distributed computing model.

机译:使用Hadoop / MapReduce作为分布式计算模型的生物序列分析。

获取原文
获取原文并翻译 | 示例

摘要

Most Biological (DNA, RNA or Protein) sequence analyzing algorithms are complex and require extensive execution time and memory. Serial Biological Sequence Processing Algorithms do not use the computing power of present computers very efficiently. Today, researchers and scientists have developed and tested many programming models for parallelizing and optimizing algorithms to decrease execution time and memory used.;MapReduce is a programming model based on functional programming, where users implement interface of two functions - map and reduce. In general, map is a kind of application of functions and reduce is he aggregations of the results of those applications. MapReduce Programming Model is patented by Google. In this research, Hadoop implementation of MapReduce was used. Hadoop and Hadoop Distributed File System are open source models of MapReduce and Google File System. Hadoop framework automatically transforms map and reduce applications into map and reduce tasks.;All known biological sequences and their functional annotations are stored in biological databases. A newly determined biological sequence should be compared with each and every known corresponding biological sequence to detect potential structural or evolutionary relationships. From a computational point of view, a major challenge is to align the query biological sequence to a very large collection of biological sequences and sort them according to the score of their alignment with the input biological sequence. The solution has to be fast and scalable.;The main goals of this thesis research are: • To build a fully-distributed Ubuntu Hadoop cluster of four nodes. • To configure and test Hadoop cluster in the LittleFe cluster computer. • To seek, determine and measure the efficiency of program in terms of used time and memory.;The main achievements/results of this thesis research are: • Transformation of the LittleFe BCCD operating system cluster computer into the Ubuntu operating system cluster computer. • Two Hadoop examples - the RandomTextWriter.java and SecondarySort.java were modified into the Hadoop MRGenerateDNA.java program to generate big file of random DNA sequences and the Hadoop MRSortDNA.java program to sort DNA sequences in an order respectively. • Proved that Hadoop is an efficient programming model to develop new parallel algorithms for biological sequence processing based on Map Reduce Programming model.
机译:大多数生物学(DNA,RNA或蛋白质)序列分析算法都很复杂,并且需要大量的执行时间和内存。串行生物序列处理算法不能非常有效地利用当前计算机的计算能力。如今,研究人员和科学家已经开发和测试了许多编程模型,用于并行化和优化算法以减少执行时间和使用的内存。MapReduce是基于函数式编程的编程模型,用户在其中实现两个函数的接口-映射和化简。通常,map是函数的一种应用程序,而归纳法是那些应用程序的结果的汇总。 MapReduce编程模型已获得Google的专利。在本研究中,使用了MapReduce的Hadoop实现。 Hadoop和Hadoop分布式文件系统是MapReduce和Google File System的开源模型。 Hadoop框架自动将地图和约简应用程序转换为地图和约简任务。;所有已知的生物序列及其功能注释都存储在生物数据库中。应该将新确定的生物学序列与每个已知的相应生物学序列进行比较,以检测潜在的结构或进化关系。从计算的角度来看,一个主要的挑战是将查询生物序列与大量生物序列进行比对,并根据其与输入生物序列的比对得分对它们进行排序。该解决方案必须快速且可扩展。;本论文研究的主要目标是:•构建一个由四个节点组成的完全分布式的Ubuntu Hadoop集群。 •在LittleFe群集计算机中配置和测试Hadoop群集。 •根据所使用的时间和内存来寻找,确定和衡量程序的效率。;本研究的主要成果/结果是:•将LittleFe BCCD操作系统集群计算机转换为Ubuntu操作系统集群计算机。 •将两个Hadoop示例(RandomTextWriter.java和SecondarySort.java)修改为Hadoop MRGenerateDNA.java程序以生成随机DNA序列的大文件,并修改了Hadoop MRSortDNA.java程序以分别对DNA序列进行排序。 •证明Hadoop是一种有效的编程模型,可以基于Map Reduce编程模型开发用于生物序列处理的新并行算法。

著录项

  • 作者

    Paudel, Roshan.;

  • 作者单位

    Morgan State University.;

  • 授予单位 Morgan State University.;
  • 学科 Biology Bioinformatics.;Computer Science.
  • 学位 M.S.
  • 年度 2012
  • 页码 87 p.
  • 总页数 87
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号