您现在的位置:首页>美国卫生研究院文献>Journal of Computational Biology

期刊信息

  • 期刊名称:

    -

  • 刊频: Monthly, 2009-
  • NLM标题:
  • iso缩写: -
  • ISSN: -
  • 排序:
  • 显示:
  • 每页:
全选(0
<2/20>
388条结果
  • 机译 初始聚类分析
    摘要:We study a simple abstract problem motivated by a variety of applications in protein sequence analysis. Consider a string of 0s and 1s of length L, and containing D 1s. If we believe that some or all of the 1s may be clustered near the start of the sequence, which subset is the most significantly so clustered, and how significant is this clustering? We approach this question using the minimum description length principle and illustrate its application by analyzing residues that distinguish translational initiation and elongation factor guanosine triphosphatases (GTPases) from other P-loop GTPases. Within a structure of yeast elongation factor 1, these residues form a significant cluster centered on a region implicated in guanine nucleotide exchange. Various biomedical questions may be cast as the abstract problem considered here.
  • 机译 刚体转化的生物分子结构的对称参数化。
    摘要:Assessing preferred relative rigid body position and orientation is important in the description of biomolecular structures (such as proteins) and their interactions. In this article, we extend and apply the “symmetrical parameterization,” which we recently introduced in the kinematics community, to address problems in structural biology. We also review parameterization methods that are widely used in structural biology to describe relative rigid body motions (in particular, orientations) as a basis for comparison. The new symmetrical parameterization is useful in describing the relative biomolecular rigid body motions, where the parameters are symmetrical in the sense that the subunits of a complex biomolecular structure are described in the same way for the corresponding motion and its inverse. The properties of this new parameterization, singularity analysis, and inverse kinematics are also investigated in more detail. Finally, parameterization is applied to real biomolecular structures and a potential application to structure modeling of symmetric macromolecules to show the efficacy of the symmetrical parameterization in the field of computational structural biology.
  • 机译 在机器人启发的方法中维持和增强采样蛋白质构象的多样性
    摘要:The ability to efficiently sample structurally diverse protein conformations allows one to gain a high-level view of a protein's energy landscape. Algorithms from robot motion planning have been used for conformational sampling, and several of these algorithms promote diversity by keeping track of “coverage” in conformational space based on the local sampling density. However, large proteins present special challenges. In particular, larger systems require running many concurrent instances of these algorithms, but these algorithms can quickly become memory intensive because they typically keep previously sampled conformations in memory to maintain coverage estimates. In addition, robotics-inspired algorithms depend on defining useful perturbation strategies for exploring the conformational space, which is a difficult task for large proteins because such systems are typically more constrained and exhibit complex motions. In this article, we introduce two methodologies for maintaining and enhancing diversity in robotics-inspired conformational sampling. The first method addresses algorithms based on coverage estimates and leverages the use of a low-dimensional projection to define a global coverage grid that maintains coverage across concurrent runs of sampling. The second method is an automatic definition of a perturbation strategy through readily available flexibility information derived from B-factors, secondary structure, and rigidity analysis. Our results show a significant increase in the diversity of the conformations sampled for proteins consisting of up to 500 residues when applied to a specific robotics-inspired algorithm for conformational sampling. The methodologies presented in this article may be vital components for the scalability of robotics-inspired approaches.
  • 机译 嘈杂实验数据的基因表达模式的两指数模型
    摘要:Spatial pattern formation of the primary anterior–posterior morphogenetic gradient of the transcription factor Bicoid (Bcd) has been studied experimentally and computationally for many years. Bcd specifies positional information for the downstream segmentation genes, affecting the fly body plan. More recently, a number of researchers have focused on the patterning dynamics of the underlying bcd messenger RNA (mRNA) gradient, which is translated into Bcd protein. New, more accurate techniques for visualizing bcd mRNA need to be combined with quantitative signal extraction techniques to reconstruct the bcd mRNA distribution. Here, we present a robust technique for quantifying gradients with a two-exponential model. This approach (1) has natural, biologically relevant parameters and (2) is invariant to linear transformations of the data arising due to variation in experimental conditions (e.g., microscope settings, nonspecific background signal). This allows us to quantify bcd mRNA gradient variability from embryo to embryo (important for studying the robustness of developmental regulatory networks); sort out atypical gradients; and classify embryos to developmental stage by quantitative gradient parameters.
  • 机译 BBK *(在K *上分支和绑定):一种可验证且有效的基于集合体的蛋白质设计算法,可优化大序列空间上的稳定性和结合亲和力
    摘要:Computational protein design (CPD) algorithms that compute binding affinity, Ka, search for sequences with an energetically favorable free energy of binding. Recent work shows that three principles improve the biological accuracy of CPD: ensemble-based design, continuous flexibility of backbone and side-chain conformations, and provable guarantees of accuracy with respect to the input. However, previous methods that use all three design principles are single-sequence (SS) algorithms, which are very costly: linear in the number of sequences and thus exponential in the number of simultaneously mutable residues. To address this computational challenge, we introduce BBK*, a new CPD algorithm whose key innovation is the multisequence (MS) bound: BBK* efficiently computes a single provable upper bound to approximate Ka for a combinatorial number of sequences, and avoids SS computation for all provably suboptimal sequences. Thus, to our knowledge, BBK* is the first provable, ensemble-based CPD algorithm to run in time sublinear in the number of sequences. Computational experiments on 204 protein design problems show that BBK* finds the tightest binding sequences while approximating Ka for up to 105-fold fewer sequences than the previous state-of-the-art algorithms, which require exhaustive enumeration of sequences. Furthermore, for 51 protein–ligand design problems, BBK* provably approximates Ka up to 1982-fold faster than the previous state-of-the-art iMinDEE// algorithm. Therefore, BBK* not only accelerates protein designs that are possible with previous provable algorithms, but also efficiently performs designs that are too large for previous methods.
  • 机译 多个肿瘤样本的系统发生拷贝数分解
    摘要:Cancer is an evolutionary process driven by somatic mutations. This process can be represented as a phylogenetic tree. Constructing such a phylogenetic tree from genome sequencing data is a challenging task due to the many types of mutations in cancer and the fact that nearly all cancer sequencing is of a bulk tumor, measuring a superposition of somatic mutations present in different cells. We study the problem of reconstructing tumor phylogenies from copy-number aberrations (CNAs) measured in bulk-sequencing data. We introduce the Copy-Number Tree Mixture Deconvolution (CNTMD) problem, which aims to find the phylogenetic tree with the fewest number of CNAs that explain the copy-number data from multiple samples of a tumor. We design an algorithm for solving the CNTMD problem and apply the algorithm to both simulated and real data. On simulated data, we find that our algorithm outperforms existing approaches that either perform deconvolution/factorization of mixed tumor samples or build phylogenetic trees assuming homogeneous tumor samples. On real data, we analyze multiple samples from a prostate cancer patient, identifying clones within these samples and a phylogenetic tree that relates these clones and their differing proportions across samples. This phylogenetic tree provides a higher resolution view of copy-number evolution of this cancer than published analyses.
  • 机译 蛋白质序列的自适应局部重排
    摘要:While mutation rates can vary markedly over the residues of a protein, multiple sequence alignment tools typically use the same values for their scoring-function parameters across a protein's entire length. We present a new approach, called adaptive local realignment, that in contrast automatically adapts to the diversity of mutation rates along protein sequences. This builds upon a recent technique known as parameter advising, which finds global parameter settings for an aligner, to now adaptively find local settings. Our approach in essence identifies local regions with low estimated accuracy, constructs a set of candidate realignments using a carefully-chosen collection of parameter settings, and replaces the region if a realignment has higher estimated accuracy. This new method of local parameter advising, when combined with prior methods for global advising, boosts alignment accuracy as much as 26% over the best default setting on hard-to-align protein benchmarks, and by 6.4% over global advising alone. Adaptive local realignment has been implemented within the Opal aligner using the Facet accuracy estimator.
  • 机译 用于将长读映射到大型参考数据库的快速近似算法
    摘要:Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows–Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.
  • 机译 基于时间结构自动学习预测模型的纵向基因型-表型关联研究
    摘要:With the rapid development of high-throughput genotyping and neuroimaging techniques, imaging genetics has drawn significant attention in the study of complex brain diseases such as Alzheimer's disease (AD). Research on the associations between genotype and phenotype improves the understanding of the genetic basis and biological mechanisms of brain structure and function. AD is a progressive neurodegenerative disease; therefore, the study on the relationship between single nucleotide polymorphism (SNP) and longitudinal variations of neuroimaging phenotype is crucial. Although some machine learning models have recently been proposed to capture longitudinal patterns in genotype–phenotype association studies, most machine-learning models base the learning on fixed structure among longitudinal prediction tasks rather than automatically learning the interrelationships. In response to this challenge, we propose a new automated time structure learning model to automatically reveal the longitudinal genotype–phenotype interactions and exploits such learned structure to enhance the phenotypic predictions. We proposed an efficient optimization algorithm for our model and provided rigorous theoretical convergence proof. We performed experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort for longitudinal phenotype prediction, including 3123 SNPs and 2 biomarkers (Voxel-Based Morphometry and FreeSurfer). The empirical results validate that our proposed model is superior to the counterparts. In addition, the best SNPs identified by our model have been replicated in the literature, which justifies our prediction.
  • 机译 用于将短读物映射到人类基因组的安全比对算法
    摘要:The elastic and inexpensive computing resources such as clouds have been recognized as a useful solution to analyzing massive human genomic data (e.g., acquired by using next-generation sequencers) in biomedical researches. However, outsourcing human genome computation to public or commercial clouds was hindered due to privacy concerns: even a small number of human genome sequences contain sufficient information for identifying the donor of the genomic data. This issue cannot be directly addressed by existing security and cryptographic techniques (such as homomorphic encryption), because they are too heavyweight to carry out practical genome computation tasks on massive data. In this article, we present a secure algorithm to accomplish the read mapping, one of the most basic tasks in human genomic data analysis based on a hybrid cloud computing model. Comparing with the existing approaches, our algorithm delegates most computation to the public cloud, while only performing encryption and decryption on the private cloud, and thus makes the maximum use of the computing resource of the public cloud. Furthermore, our algorithm reports similar results as the nonsecure read mapping algorithms, including the alignment between reads and the reference genome, which can be directly used in the downstream analysis such as the inference of genomic variations. We implemented the algorithm in C++ and Python on a hybrid cloud system, in which the public cloud uses an Apache Spark system.
  • 机译 元基因组样本的无偏分类注释
    摘要:The classification of reads from a metagenomic sample using a reference taxonomy is usually based on first mapping the reads to the reference sequences and then classifying each read at a node under the lowest common ancestor of the candidate sequences in the reference taxonomy with the least classification error. However, this taxonomic annotation can be biased by an imbalanced taxonomy and also by the presence of multiple nodes in the taxonomy with the least classification error for a given read. In this article, we show that the Rand index is a better indicator of classification error than the often used area under the receiver operating characteristic (ROC) curve and F-measure for both balanced and imbalanced reference taxonomies, and we also address the second source of bias by reducing the taxonomic annotation problem for a whole metagenomic sample to a set cover problem, for which a logarithmic approximation can be obtained in linear time and an exact solution can be obtained by integer linear programming. Experimental results with a proof-of-concept implementation of the set cover approach to taxonomic annotation in a next release of the TANGO software show that the set cover approach further reduces ambiguity in the taxonomic annotation obtained with TANGO without distorting the relative abundance profile of the metagenomic sample.
  • 机译 病毒衣壳装配:不确定性量化方法
    摘要:Most of the existing research in assembly pathway prediction/analysis of viral capsids makes the simplifying assumption that the configuration of the intermediate states can be extracted directly from the final configuration of the entire capsid. This assumption does not take into account the conformational changes of the constituent proteins as well as minor changes to the binding interfaces that continue throughout the assembly process until stabilization. This article presents a statistical-ensemble-based approach that samples the configurational space for each monomer with the relative local orientation between monomers, to capture the uncertainties in binding and conformations. Further, instead of using larger capsomers (trimers, pentamers) as building blocks, we allow all possible subassemblies to bind in all possible combinations. We represent the resulting assembly graph in two different ways: First, we use the Wilcoxon signed-rank measure to compare the distributions of binding free energy computed on the sampled conformations to predict likely pathways. Second, we represent chemical equilibrium aspects of the transitions as a Bayesian Factor graph where both associations and dissociations are modeled based on concentrations and the binding free energies. We applied these protocols on the feline panleukopenia virus and the Nudaurelia capensis virus. Results from these experiments showed a significant departure from those that one would obtain if only the static configurations of the proteins were considered. Hence, we establish the importance of an uncertainty-aware protocol for pathway analysis, and we provide a statistical framework as an important first step toward assembly pathway prediction with high statistical confidence.
  • 机译 β链中心线的扭曲定量
    摘要:Since the discovery of right-handed twist of a β-strand, many studies have been conducted to understand the twist. Given the atomic structure of a protein, twist angles have been defined using atomic positions of the backbone. However, limited study is available to characterize twist when the atomic positions are not available, but the central lines of β-strands are. Recent studies in cryoelectron microscopy show that it is possible to predict the central lines of β-strands from a medium-resolution density map. Accurate measurement of twist angles is important in identification of β-strands from such density maps. We propose an effective method to quantify twist angles from a set of splines. In a data set of 55 pairs of β-strands from 11 β-sheets of 11 proteins, the spline measurement shows comparable results as measured using the discrete method that uses atomic positions directly, particularly in capturing twist angle change along a pair, different levels of twist among different pairs, and the average of twist angles. The proposed method provides an alternative method to characterize twist using the central lines of a β-sheet.
  • 机译 辐射果蝇中剂量计的基因表达。
    摘要:Biological indicators would be of use in radiation dosimetry in situations where an exposed person is not wearing a dosimeter, or when physical dosimeters are insufficient to estimate the risk caused by the radiation exposure. In this work, we investigate the use of gene expression as a dosimeter. Gene expression analysis was done on 15,222 genes of Drosophila melanogaster (fruit flies) at days 2, 10, and 20 postirradiation, with X-ray exposures of 10, 1000, 5000, 10,000, and 20,000 roentgens. Several genes were identified, which could serve as a biodosimeter in an irradiated D. melanogaster model. Many of these genes have human homologues. Six genes showed a linear response (R2 > 0.9) with dose at all time points. One of these genes, inverted repeat-binding protein, is a known DNA repair gene and has a human homologue (XRCC6). The lowest dose, 10 roentgen, is very low for fruit flies. If the lowest dose is excluded, 13 genes showed a linear response with dose at all time points. This includes 5 of 6 genes that were linear with all radiation doses included. Of these 13 genes, 4 have human homologues and 8 have known functions. The expression of this panel of genes, particularly those with human homologues, could potentially be used as the biological indicator of radiation exposure in dosimetry applications.
  • 机译 腔与配体形状描述符:在尿激酶结合口袋中的应用
    摘要:We analyzed 78 binding pockets of the human urokinase plasminogen activator (uPA) catalytic domain extracted from a data set of crystallized uPA–ligand complexes. These binding pockets were computed with an original geometric method that does NOT involve any arbitrary parameter, such as cutoff distances, angles, and so on. We measured the deviation from convexity of each pocket shape with the pocket convexity index (PCI). We defined a new pocket descriptor called distributional sphericity coefficient (DISC), which indicates to which extent the protein atoms of a given pocket lie on the surface of a sphere. The DISC values were computed with the freeware PCI. The pocket descriptors and their high correspondences with ligand descriptors are crucial for polypharmacology prediction. We found that the protein heavy atoms lining the urokinases binding pockets are either located on the surface of their convex hull or lie close to this surface. We also found that the radii of the urokinases binding pockets and the radii of their ligands are highly correlated (r = 0.9).
  • 机译 匹配基因树和物种树的祖先配置的枚举
    摘要:Given a gene tree and a species tree, ancestral configurations represent the combinatorially distinct sets of gene lineages that can reach a given node of the species tree. They have been introduced as a data structure for use in the recursive computation of the conditional probability under the multispecies coalescent model of a gene tree topology given a species tree, the cost of this computation being affected by the number of ancestral configurations of the gene tree in the species tree. For matching gene trees and species trees, we obtain enumerative results on ancestral configurations. We study ancestral configurations in balanced and unbalanced families of trees determined by a given seed tree, showing that for seed trees with more than one taxon, the number of ancestral configurations increases for both families exponentially in the number of taxa n. For fixed n, the maximal number of ancestral configurations tabulated at the species tree root node and the largest number of labeled histories possible for a labeled topology occur for trees with precisely the same unlabeled shape. For ancestral configurations at the root, the maximum increases with , where is a quadratic recurrence constant. Under a uniform distribution over the set of labeled trees of given size, the mean number of root ancestral configurations grows with and the variance with ∼. The results provide a contribution to the combinatorial study of gene trees and species trees.
  • 机译 在De Novo元基因组分析管道中评估汇编程序对病毒检测的影响
    摘要:Applying high-throughput sequencing to pathogen discovery is a relatively new field, the objective of which is to find disease-causing agents when little or no background information on disease is available. Key steps in the process are the generation of millions of sequence reads from an infected tissue sample, followed by assembly of these reads into longer, contiguous stretches of nucleotide sequences, and then identification of the contigs by matching them to known databases, such as those stored at GenBank or Ensembl. This technique, that is, de novo metagenomics, is particularly useful when the pathogen is viral and strong discriminatory power can be achieved. However, recently, we found that striking differences in results can be achieved when different assemblers were used. In this study, we test formally the impact of five popular assemblers (MIRA, VELVET, METAVELVET, SPADES, and OMEGA) on the detection of a novel virus and assembly of its whole genome in a data set for which we have confirmed the presence of the virus by empirical laboratory techniques, and compare the overall performance between assemblers. Our results show that if results from only one assembler are considered, biologically important reads can easily be overlooked. The impacts of these results on the field of pathogen discovery are considered.
  • 机译 Zseq:一种预处理下一代测序数据的方法
    摘要:Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold.Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.
  • 机译 gCoda:成分数据的条件依赖网络推断
    摘要:The increasing quality and the reducing cost of high-throughput sequencing technologies for 16S rRNA gene profiling enable researchers to directly analyze microbe communities in natural environments. The direct interactions among microbial species of a given ecological system can help us understand the principles of community assembly and maintenance under various conditions. Compositionality and dimensionality of microbiome data are two main challenges for inferring the direct interaction network of microbes. In this article, we use the logistic normal distribution to model the background mechanism of microbiome data, which can appropriately deal with the compositional nature of the data. The direct interaction relationships are then modeled via the conditional dependence network under this logistic normal assumption. We then propose a novel penalized maximum likelihood method called gCoda to estimate the sparse structure of inverse covariance for latent normal variables to address the high dimensionality of the microbiome data. An effective Majorization-Minimization algorithm is proposed to solve the optimization problem in gCoda. Simulation studies show that gCoda outperforms existing methods (e.g., SPIEC-EASI) in edge recovery of inverse covariance for compositional data under a variety of scenarios. gCoda also performs better than SPIEC-EASI for inferring direct microbial interactions of mouse skin microbiome data.
  • 机译 RareVar:用于检测低频单核苷酸变体的框架
    摘要:Accurate identification of low-frequency somatic point mutations in tumor samples has important clinical utilities. Although high-throughput sequencing technology enables capturing such variants while sequencing primary tumor samples, our ability for accurate detection is compromised when the variant frequency is close to the sequencer error rate. Most current experimental and bioinformatic strategies target mutations with ≥5% allele frequency, which limits our ability to understand the cancer etiology and tumor evolution. We present an experimental and computational modeling framework, RareVar, to reliably identify low-frequency single-nucleotide variants from high-throughput sequencing data under standard experimental protocols. RareVar protocol includes a benchmark design by pooling DNAs from already sequenced individuals at various concentrations to target variants at desired frequencies, 0.5%–3% in our case. By applying a generalized, linear model-based, position-specific error model, followed by machine-learning-based variant calibration, our approach outperforms existing methods. Our method can be applied on most capture and sequencing platforms without modifying the experimental protocol.

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号