首页> 外文OA文献 >Information recovery in the biological sciences : protein structure determination by constraint satisfaction, simulation and automated image processing
【2h】

Information recovery in the biological sciences : protein structure determination by constraint satisfaction, simulation and automated image processing

机译:生物科学中的信息恢复:通过约束满足,模拟和自动图像处理确定蛋白质结构

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Regardless of the field of study or particular problem, any experimental science always poses the same question: ÒWhat object or phenomena generated the data that we see, given what is known?Ó ud In the field of 2D electron crystallography, data is collected from a series of two-dimensional images, formed either as a result of diffraction mode imaging or TEM mode real imaging. The resulting dataset is acquired strictly in the Fourier domain as either coupled Amplitudes and Phases (as in TEM mode) or Amplitudes alone (in diffraction mode). In either case, data is received from the microscope in a series of CCD or scanned negatives of images which generally require a significant amount of pre-processing in order to be useful. ud Traditionally, processing of the large volume of data collected from the microscope was the time limiting factor in protein structure determination by electron microscopy. Data must be initially collected from the microscope either on film-negatives, which in turn must be developed and scanned, or from CCDs of sizes typically no larger than 2096x2096 (though larger models are in operation). In either case, data are finally ready for processing as 8-bit, 16-bit or (in principle) 32-bit grey-scale images.ud Regardless of data source, the foundation of all crystallographic methods is the presence of a regular Fourier lattice. Two dimensional cryo-electron microscopy of proteins introduces special challenges as multiple crystals may be present in the same image, producing in some cases several independent lattices. Additionally, scanned negatives typically have a rectangular region marking the film number and other details of image acquisition that must be removed prior to processing. ud If the edges of the images are not down-tapered, vertical and horizontal ÒstreaksÓ will be present in the Fourier transform of the image --arising from the high-resolution discontinuities between the opposite edges of the image. These streaks can overlap with lattice points which fall close to the vertical and horizontal axes and disrupt both the information they contain and the ability to detect them. Lastly, SpotScanning (Downing, 1991) is a commonly used process where-by circular discs are individually scanned in an image. The large-scale regularity of the scanning patter produces a low frequency lattice which can interfere and overlap with any protein crystal lattices. ud We introduce a series of methods packaged into 2dx (Gipson, et al., 2007) which simultaneously addresses these problems, automatically detecting accurate crystal lattice parameters for a majority of images. Further a template is described for the automation of all subsequent image processing steps on the road to a fully processed dataset.ud The broader picture of image processing is one of reproducibility. The lattice parameters, for instance, are only one of hundreds of parameters which must be determined or provided and subsequently stored and accessed in a regular way during image processing. Numerous steps, from correct CTF and tilt-geometry determination to the final stages of symmetrization and optimal image recovery must be performed sequentially and repeatedly for hundreds of images.ud The goal in such a project is then to automatically process as significant a portion of the data as possible and to reduce unnecessary, repetitive data entry by the user. Here also, 2dx (Gipson, et al., 2007), the image processing package designed to automatically process individual 2D TEM images is introduced. This package focuses on reliability, ease of use and automation to produce finished results necessary for full three-dimensional reconstruction of the protein in question.ud Once individual 2D images have been processed, they contribute to a larger project-wide 3-dimensional dataset. Several challenges exist in processing this dataset, besides simply the organization of results and project-wide parameters. In particular, though tilt-geometry, relative amplitude scaling and absolute orientation are in principle known (or obtainable from an individual image) errors, uncertainties and heterogeneous data-types provide for a 3D-dataset with many parameters to be optimized. 2dx_merge (Gipson, et al., 2007) is the follow-up to the first release of 2dx which had originally processed only individual images. Based on the guiding principles of the earlier release, 2dx_merge focuses on ease of use and automation. The result is a fully qualified 3D structure determination package capable of turning hundreds of electron micrograph images, nearly completely automatically, into a full 3D structure. ud Most of the processing performed in the 2dx package is based on the excellent suite of programs termed collectively as the MRC package (Crowther, et al., 1996). Extensions to this suite and alternative algorithms continue to play an essential role in image processing as computers become faster and as advancements are made in the mathematics of signal processing. In this capacity, an alternative procedure to generate a 3D structure from processed 2D images is presented. This algorithm, entitled ÒProjective Constraint OptimizationÓ (PCO), leverages prior known information, such as symmetry and the fact that the protein is bound in a membrane, to extend the normal boundaries of resolution. In particular, traditional methods (Agard, 1983) make no attempt to account for the Òmissing coneÓ a vast, un-sampled, region in 3D Fourier space arising from specimen tilt limitations in the microscope. Provided sufficient data, PCO simultaneously refines the dataset, accounting for error, as well as attempting to fill this missing cone.ud Though PCO provides a near-optimal 3D reconstruction based on data, depending on initial data quality and amount of prior knowledge, there may be a host of solutions, and more importantly pseudo-solutions, which are more-or-less consistent with the provided dataset. Trying to find a global best-fit for known information and data can be a daunting challenge mathematically, to this end the use of meta-heuristics is addressed. Specifically, in the case of many pseudo-solutions, so long as a suitably defined error metric can be found, quasi-evolutionary swarm algorithms can be used that search solution space, sharing data as they go. Given sufficient computational power, such algorithms can dramatically reduce the search time for global optimums for a given dataset. ud Once the structure of a protein has been determined, many questions often remain about its function. Questions about the dynamics of a protein, for instance, are not often readily interpretable from structure alone. To this end an investigation into computationally optimized structural dynamics is described. Here, in order to find the most likely path a protein might take through Òconformation spaceÓ between two conformations, a graphics processing unit (GPU) optimized program and set of libraries is written to speed of the calculation of this process 30x. The tools and methods developed here serve as a conceptual template as to how GPU coding was applied to other aspects of the work presented here as well as GPU programming generally. ud The final portion of the thesis takes an apparent step in reverse, presenting a dramatic, yet highly predictive, simplification of a complex biological process. Kinetic Monte Carlo simulations idealize thousands of proteins as interacting agents by a set of simple rules (i.e. react/dissociate), offering highly-accurate insights into the large-scale cooperative behavior of proteins. This work demonstrates that, for many applications, structure, dynamics or even general knowledge of a protein may not be necessary for a meaningful biological story to emerge. Additionally, even in cases where structure and function is known, such simulations can help to answer the biological question in its entirety from structure, to dynamics, to ultimate function.udud
机译:无论研究领域或特定问题如何,任何实验科学都始终提出相同的问题:“在已知的情况下,是什么物体或现象产生了我们看到的数据?” ud在2D电子晶体学领域,数据是从由衍射模式成像或TEM模式实像成像形成的一系列二维图像。严格在傅立叶域中以耦合的振幅和相位(如在TEM模式下)或单独的振幅(在衍射模式下)获得所得数据集。在任何一种情况下,都是通过一系列CCD或显微镜扫描的负像从显微镜接收数据的,通常需要大量预处理才能使用。传统上,处理从显微镜收集的大量数据是通过电子显微镜确定蛋白质结构的时间限制因素。最初必须从显微镜上通过负片(必须依次显影和扫描)或从通常尺寸不大于2096x2096的CCD(尽管正在运行较大型号)的CCD收集数据。无论哪种情况,数据最终都准备好以8位,16位或(原则上)为32位灰度级图像进行处理。 ud无论数据源如何,所有晶体学方法的基础都是存在常规的傅立叶晶格。蛋白质的二维低温电子显微镜术带来了特殊的挑战,因为同一图像中可能存在多个晶体,在某些情况下会产生多个独立的晶格。此外,扫描底片通常具有标记胶片编号和图像采集其他细节的矩形区域,必须在处理之前将其删除。 ud如果图像的边缘没有逐渐变细,则图像的傅立叶变换中将出现垂直和水平的“条纹”,这是由于图像相对边缘之间的高分辨率不连续性引起的。这些条纹可能会与落在垂直轴和水平轴附近的晶格点重叠,从而破坏它们包含的信息以及检测它们的能力。最后,SpotScanning(Downing,1991)是一个常用的过程,其中圆盘分别在图像中进行扫描。扫描图案的大规模规律性产生了低频晶格,该低频晶格会干扰任何蛋白质晶格并与之重叠。 ud我们介绍了打包到2dx中的一系列方法(Gipson等,2007),这些方法可以同时解决这些问题,并自动检测大多数图像的准确晶格参数。此外,还描述了一个模板,用于在通往已完全处理的数据集的过程中实现所有后续图像处理步骤的自动化。 ud图像处理的更广阔前景是可重复性之一。例如,晶格参数只是必须确定或提供的数百个参数之一,随后必须在图像处理过程中以常规方式存储和访问这些参数。从正确的CTF和倾斜几何的确定到对称化的最终阶段以及最佳图像恢复的许多步骤必须针对数百张图像顺序并重复执行。 ud然后,该项目的目标是自动处理大部分尽可能减少数据输入,并减少用户不必要的重复数据输入。这里还介绍了2dx(Gipson等人,2007),该图像处理包旨在自动处理单个2D TEM图像。该软件包着重于可靠性,易用性和自动化程度,以产生完整的蛋白质三维三维重建所必需的最终结果。 ud一旦处理了单个2D图像,它们将构成更大的项目范围的3D数据集。除了简单地组织结果和项目范围的参数外,在处理此数据集时还存在一些挑战。特别地,尽管原则上已知倾斜几何,相对幅度缩放和绝对取向(或可从单个图像获得)误差,不确定性和异构数据类型为具有许多参数的3D数据集提供了要优化的条件。 2dx_merge(Gipson等人,2007)是2dx最初发行的后续版本,该版本最初仅处理单个图像。根据早期版本的指导原则,2dx_merge专注于易用性和自动化。结果是获得了完全合格的3D结构确定程序包,该程序包几乎可以完全自动地将数百张电子显微图像转换为完整的3D结构。 udd在2dx软件包中执行的大多数处理都是基于优秀的程序套件(统称为MRC软件包)进行的(Crowther等人,(1996)。随着计算机变得越来越快以及信号处理数学的进步,对该套件和其他算法的扩展在图像处理中继续发挥重要作用。以这种能力,提出了一种从处理后的2D图像生成3D结构的替代过程。这种名为“投影约束优化”(PCO)的算法利用了先前已知的信息(例如对称性和蛋白质结合在膜上的事实)来扩展分辨率的正常边界。特别是,传统方法(Agard,1983年)没有试图解释“缺失锥”的原因,这是由于显微镜中的样品倾斜限制而导致的3D傅立叶空间中一个巨大的未采样区域。提供足够的数据后,PCO会同时优化数据集,考虑错误并尝试填充此缺失的圆锥。 ud虽然PCO根据数据提供了接近最佳的3D重建,但取决于初始数据质量和先验知识的数量,可能有许多解决方案,更重要的是伪解决方案,这些解决方案与提供的数据集或多或少一致。在数学上试图找到最适合已知信息和数据的全球性挑战可能是艰巨的挑战,为此,解决了元启发法的使用。具体而言,在许多伪解的情况下,只要可以找到适当定义的错误度量,就可以使用准进化群算法来搜索解空间,并随需共享数据。给定足够的计算能力,此类算法可以大大减少给定数据集的全局最优搜索时间。 ud一旦确定了蛋白质的结构,关于其功能的问题通常仍然很多。例如,关于蛋白质动力学的问题通常不能仅从结构上轻易解释。为此,描述了对计算优化的结构动力学的研究。在这里,为了找到蛋白质可能通过两个构象之间的“构象空间”的最可能路径,编写了图形处理单元(GPU)优化程序和库集,以使该过程的计算速度提高了30倍。此处开发的工具和方法用作有关GPU编码如何应用于此处介绍的工作的其他方面以及GPU编程的概念模板。 ud论文的最后一部分显然是相反的,它提出了一个复杂而又具有高度预测性的复杂生物过程的简化过程。动力学蒙特卡洛模拟通过一套简单的规则(即反应/离解)将数千种蛋白质理想化为相互作用剂,从而提供了对蛋白质大规模协同行为的高度准确的见解。这项工作表明,对于许多应用而言,对于有意义的生物学故事来说,蛋白质的结构,动力学或什至是一般知识可能不是必需的。此外,即使在已知结构和功能的情况下,此类模拟也可以帮助回答从结构,动力学到最终功能的整个生物学问题。

著录项

  • 作者

    Gipson Bryant;

  • 作者单位
  • 年度 2010
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号