【24h】

Discovery of Unknown Bacteria: A Metagenomic Analysis

机译:发现未知细菌:元基因组学分析

获取原文

摘要

The North Railroad Avenue Plume (NRAP)Superfund Site in Espa(n)ola New Mexico is tetrachioroethene (aka perchloroethylene or PCE)contaminated drinking water aquifer. PCE is a possible carcinogen. So it is important to purify this aquifer. Fortunately, microbial communities evolve to utilize these contaminants for energy. It is extremely essential to understand the metabolic pathway,genetic information of those microorganisms which help in purification. Since environmental microbes cannot be cultured in laboratories, tests are conducted in the site. To aid in microbial growth, Emulsified Vegetable Oil (EVO) and other amendments are added to this aquifer. Parameters such as PCE and its breakdown products, trichloroethylene (TCE), cis and trans 1,2-dichloroethene (DCE), vinyl chloride,ethane, and water quality parameters such as dissolved oxygen, temperature, pH, and redox potential are monitored on a time to time bases. DNA of all microbes found in this pond is also extracted on time to time bases. It is important to sequence these DNA and to study those genes which are responsible for bioremediation.Solexa IG is a Massively Parallel Sequencing by Synthesis (MPSS) instrument which produces fragments of ~200 base pairs with about 36-50bp for paired end reads. The challenge is to align these fragments/reads to get the entire sequence of individual microbes based on overlapping reads. There are about 109 reads originating from different microbes. Few of these microbes have already been sequenced. It is easy to remove these fragments by BLAST'ing them against the database. There are still a large number of reads to be aligned. Paired end information is used to perform this alignment using the best string matching algorithms. Still these comparisons would take remarkably long. Hence supercomputers with exceptionally high computing powers are used to make this computation possible and faster. The deal is to find exactly that read that would follow the current read to complete sequence by eliminating mismatches due to repeats greatly. To minimize comparisons, we apply machine leaming techniques. Since the current amount of known genomes is far less than 1% of the entire microbial genomics, presently available training data may be insufficient for supervised learning methods with multi-class support vector machine (SVM). In view of the fact that, the number of different microbes in the sequence is unknown and that SOM needs apriori definition of the architecture, we use growing hierarchical self-organizing map (GHSOM).As classification of microbial communities can be improved by extracting the features of transition metrics of a Markov process instead of word frequency, we use a combination of transition features of Markov processes with GHSOM. Once clusters are formed, reads are compared only among its group. This algorithm helps in fast alignment and assembly of our metadata. Once the sequences are aligned feature mining techniques are used to find those microbes that are responsible for biodegradation. A gene-eentric approach will be more revealing to figure out the metabolic pathway involved in biodegradation. Here features are genes whose proportion changes in response to the addition of EVO.A simulated program 'Shedder' is developed to imitate the shotgun sequencing approach to get important information regarding preprocessing and essential requirements. This program can be used to find the minimum number of reads required to get complete sequences. Alignment software will be benchmarked with simulated Solexa paired-end read data produced by Shedder.
机译:新墨西哥州Espa(n)ola的North Railroad Avenue Plume(NRAP)超级基金站点受到四氯乙烯(又名全氯乙烯或PCE)污染的饮用水含水层。 PCE是可能的致癌物。因此,净化该含水层非常重要。幸运的是,微生物群落不断发展以利用这些污染物获取能量。了解那些有助于纯化的微生物的代谢途径,遗传信息极其重要。由于环境微生物无法在实验室中培养,因此需要在现场进行测试。为了帮助微生物生长,在该含水层中添加了乳化植物油(EVO)和其他改良剂。监测PCE及其分解产物,三氯乙烯(TCE),顺式和反式1,2-二氯乙烯(DCE),氯乙烯,乙烷等参数以及水质参数(如溶解氧,温度,pH和氧化还原电势)时基。该池塘中发现的所有微生物的DNA也会不时地提取出来。对这些DNA进行测序并研究负责生物修复的基因非常重要。Solexa IG是一种大规模并行合成测序(MPSS)仪器,可产生约200个碱基对的片段,约36-50bp的片段可用于配对末端读取。挑战在于对齐这些片段/读数,以基于重叠的读数获得单个微生物的完整序列。大约有109条来自不同微生物的读数。这些微生物中几乎没有被测序过。通过对数据库进行BLAST删除这些片段很容易。仍有大量读取需要对齐。配对的最终信息用于使用最佳字符串匹配算法执行此对齐。这些比较仍然需要花费相当长的时间。因此,具有超高计算能力的超级计算机被用于使这种计算成为可能并且更快。关键是要消除由于重复而引起的不匹配,从而准确地找到与当前读序列完全相同的读序列。为了尽量减少比较,我们应用机器学习技术。由于当前已知基因组的数量远远少于整个微生物基因组学的1%,因此目前可用的训练数据可能不足以支持使用多类支持向量机(SVM)进行监督学习的方法。鉴于该序列中不同微生物的数量未知并且SOM需要先验定义体系结构,因此我们使用了不断增长的分层自组织图(GHSOM),因为可以通过提取微生物来改进微生物群落的分类。马尔可夫过程的过渡度量的特征而不是词频,我们结合使用了马尔可夫过程的过渡特征和GHSOM。形成簇后,仅在其组之间比较读取。该算法有助于快速对齐和组装我们的元数据。序列比对后,将使用特征挖掘技术来寻找负责生物降解的微生物。以基因为中心的方法将更能揭示涉及生物降解的代谢途径。这里的特征是基因的比例随EVO的添加而变化。模拟程序``Shedder''被开发来模仿the弹枪测序方法,以获得有关预处理和基本要求的重要信息。该程序可用于查找获得完整序列所需的最少读取次数。对准软件将以Shedder产生的模拟Solexa配对末端读取数据为基准。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号