您现在的位置:首页>美国卫生研究院文献>Journal of Computational Biology

期刊信息

  • 期刊名称:

    -

  • 刊频: Monthly, 2009-
  • NLM标题:
  • iso缩写: -
  • ISSN: -
  • 排序:
  • 显示:
  • 每页:
全选(0
<6/20>
394条结果
  • 机译 镇流器:一种基于球的结构图案算法
    摘要:>Structural motifs encapsulate local sequence-structure-function relationships characteristic of related proteins, enabling the prediction of functional characteristics of new proteins, providing molecular-level insights into how those functions are performed, and supporting the development of variants specifically maintaining or perturbing function in concert with other properties. Numerous computational methods have been developed to search through databases of structures for instances of specified motifs. However, it remains an open problem how best to leverage the local geometric and chemical constraints underlying structural motifs in order to develop motif-finding algorithms that are both theoretically and practically efficient. We present a simple, general, efficient approach, called Ballast (ball-based algorithm for structural motifs), to match given structural motifs to given structures. Ballast combines the best properties of previously developed methods, exploiting the composition and local geometry of a structural motif and its possible instances in order to effectively filter candidate matches. We show that on a wide range of motif-matching problems, Ballast efficiently and effectively finds good matches, and we provide theoretical insights into why it works well. By supporting generic measures of compositional and geometric similarity, Ballast provides a powerful substrate for the development of motif-matching algorithms.
  • 机译 通过收集奖品的斯坦纳森林问题同时重建多个信号通路
    摘要:>Signaling and regulatory networks are essential for cells to control processes such as growth, differentiation, and response to stimuli. Although many “omic” data sources are available to probe signaling pathways, these data are typically sparse and noisy. Thus, it has been difficult to use these data to discover the cause of the diseases and to propose new therapeutic strategies. We overcome these problems and use “omic” data to reconstruct simultaneously multiple pathways that are altered in a particular condition by solving the prize-collecting Steiner forest problem. To evaluate this approach, we use the well-characterized yeast pheromone response. We then apply the method to human glioblastoma data, searching for a forest of trees, each of which is rooted in a different cell-surface receptor. This approach discovers both overlapping and independent signaling pathways that are enriched in functionally and clinically relevant proteins, which could provide the basis for new therapeutic strategies. Although the algorithm was not provided with any information about the phosphorylation status of receptors, it identifies a small set of clinically relevant receptors among hundreds present in the interactome.
  • 机译 Dirichlet混合物Dirichlet过程和蛋白质空间的结构
    摘要:>The Dirichlet process is used to model probability distributions that are mixtures of an unknown number of components. Amino acid frequencies at homologous positions within related proteins have been fruitfully modeled by Dirichlet mixtures, and we use the Dirichlet process to derive such mixtures with an unbounded number of components. This application of the method requires several technical innovations to sample an unbounded number of Dirichlet-mixture components. The resulting Dirichlet mixtures model multiple-alignment data substantially better than do previously derived ones. They consist of over 500 components, in contrast to fewer than 40 previously, and provide a novel perspective on the structure of proteins. Individual protein positions should be seen not as falling into one of several categories, but rather as arrayed near probability ridges winding through amino acid multinomial space.
  • 机译 未对齐基因组的大局部分析及其应用
    摘要:>We describe a novel method for the local analysis of complete genomes. A local distance measure called LODIST is proposed, which is based on the relationship between the longest common words and the shortest absent words of two genomes we compared. LODIST can perform better than local alignment when the local region is large enough to cover some recombination genes. A distance measure called SILD.k.t with resolution k and step t is derived by the integral LODISTs of whole genomes. It is shown that the algorithm for computing the LODISTs and SILD.k.t is linear, which is fast enough to consider the problem of the genome comparison. We verify this method by recognizing the subtypes of the HIV-1 complete genomes and genome segments.
  • 机译 具有边际概率约束的RNA碱基对概率矩阵的直接更新
    • 作者:Michiaki Hamada
    • 刊名:Journal of Computational Biology
    • -1年第12期
    摘要:>A base-pairing probability matrix (BPPM) stores the probabilities for every possible base pair in an RNA sequence and has been used in many algorithms in RNA informatics (e.g., RNA secondary structure prediction and motif search). In this study, we propose a novel algorithm to perform iterative updates of a given BPPM, satisfying marginal probability constraints that are (approximately) given by recently developed biochemical experiments, such as SHAPE, PAR, and FragSeq. The method is easily implemented and is applicable to common models for RNA secondary structures, such as energy-based or machine-learning–based models. In this article, we focus mainly on the details of the algorithms, although preliminary computational experiments will also be presented.
  • 机译 根据邻域密度的不同类型从蛋白质相互作用网络中识别复合物
    摘要:>To facilitate the realization of biological functions, proteins are often organized into complexes. While computational techniques are used to predict these complexes, detailed understanding of their organization remains inadequate. Apart from complexes that reside in very dense regions of a protein interaction network in which most algorithms are able to identify, we observe that many other complexes, while not residing in very dense regions, reside in regions with low neighborhood density. We develop an algorithm for identifying protein complexes by considering these two types of complexes separately. We test our algorithm on a few yeast protein interaction networks, and show that our algorithm is able to identify complexes more accurately than existing algorithms. A software program NDComplex for implementing the algorithm is available at .
  • 机译 使用动态编程的基于团的方法来计算无序树之间的编辑距离
    摘要:>Many kinds of tree-structured data, such as RNA secondary structures, have become available due to the progress of techniques in the field of molecular biology. To analyze the tree-structured data, various measures for computing the similarity between them have been developed and applied. Among them, tree edit distance is one of the most widely used measures. However, the tree edit distance problem for unordered trees is NP-hard. Therefore, it is required to develop efficient algorithms for the problem. Recently, a practical method called clique-based algorithm has been proposed, but it is not fast for large trees.>This article presents an improved clique-based method for the tree edit distance problem for unordered trees. The improved method is obtained by introducing a dynamic programming scheme and heuristic techniques to the previous clique-based method. To evaluate the efficiency of the improved method, we applied the method to comparison of real tree structured data such as glycan structures. For large tree-structures, the improved method is much faster than the previous method. In particular, for hard instances, the improved method achieved more than 100 times speed-up.
  • 机译 建立共识MUL树的多项式时间算法
    摘要:>A multi-labeled phylogenetic tree, or MUL-tree, is a generalization of a phylogenetic tree that allows each leaf label to be used many times. MUL-trees have applications in biogeography, the study of host–parasite cospeciation, gene evolution studies, and computer science. Here, we consider the problem of inferring a consensus MUL-tree that summarizes a given set of conflicting MUL-trees, and present the first polynomial-time algorithms for solving it. In particular, we give a straightforward, fast algorithm for building a strict consensus MUL-tree for any input set of MUL-trees with identical leaf label multisets, as well as a polynomial-time algorithm for building a majority rule consensus MUL-tree for the special case where every leaf label occurs at most twice. We also show that, although it is NP-hard to find a majority rule consensus MUL-tree in general, the variant that we call the singular majority rule consensus MUL-tree can be constructed efficiently whenever it exists.
  • 机译 通过两性分支过程确定Y连锁基因的自然选择的期望最大化算法
    摘要:>A two-dimensional bisexual branching process has recently been presented for the analysis of the generation-to-generation evolution of the number of carriers of a Y-linked gene. In this model, preference of females for males with a specific genetic characteristic is assumed to be determined by an allele of the gene. It has been shown that the behavior of this kind of Y-linked gene is strongly related to the reproduction law of each genotype. In practice, the corresponding offspring distributions are usually unknown, and it is necessary to develop their estimation theory in order to determine the natural selection of the gene. Here we deal with the estimation problem for the offspring distribution of each genotype of a Y-linked gene when the only observable data are each generation's total numbers of males of each genotype and of females. We set out the problem in a non parametric framework and obtain the maximum likelihood estimators of the offspring distributions using an expectation-maximization algorithm. From these estimators, we also derive the estimators for the reproduction mean of each genotype and forecast the distribution of the future population sizes. Finally, we check the accuracy of the algorithm by means of a simulation study.
  • 机译 有监督的蛋白质家族分类和新的家族构建
    摘要:>The goal of protein family classification is to group proteins into families so that proteins within the same family have common function or are related by ancestry. While supervised classification algorithms are available for this purpose, most of these approaches focus on assigning unclassified proteins to known families but do not allow for progressive construction of new families from proteins that cannot be assigned. Although unsupervised clustering algorithms are also available, they do not make use of information from known families. By computing similarities between proteins based on pairwise sequence comparisons, we develop supervised classification algorithms that achieve improved accuracy over previous approaches while allowing for construction of new families. We show that our algorithm has higher accuracy rate and lower mis-classification rate when compared to algorithms that are based on the use of multiple sequence alignments and hidden Markov models, and our algorithm performs well even on families with very few proteins and on families with low sequence similarity. A software program implementing the algorithm (SClassify) is available online (>).
  • 机译 具有抑制剂和容错能力的阈值组测试的非自适应算法
    摘要:>A group test gives a positive (negative) outcome if it contains at least u (at most l) positive items, and an arbitrary outcome if the number of positive items is between thresholds l and u. This problem introduced by Damaschke is called threshold group testing. It is a generalization of classical group testing. Chen and Fu extended this problem to the error-tolerant version and first proposed efficient nonadaptive algorithms. In this article, we extend threshold group testing to the k-inhibitors model in which a test has a positive outcome if it contains at least u positives and at most k−1 inhibitors. By using (d + k − l, u; 2e + 1]-disjunct matrix we provide nonadaptive algorithms for the threshold group testing model with k-inhibitors and at most e-erroneous outcomes. The decoding complexity is O(nu+k log n) for fixed parameters (d, u, l, k, e).
  • 机译 从基因树推断物种树的一类距离矩阵方法的改进
    摘要:>Among the methods currently available for inferring species trees from gene trees, the GLASS method of Mossel and Roch (), the Shallowest Divergence (SD) method of Maddison and Knowles (), the STEAC method of Liu et al. (), and a related method that we call Minimum Average Coalescence (MAC) are computationally efficient and provide branch length estimates. Further, GLASS and STEAC have been shown to be consistent estimators of tree topology under a multispecies coalescent model. However, divergence time estimates obtained with these methods are all systematically biased under the model because the pairwise interspecific gene divergence times on which they rely must be more ancient than the species divergence time. Jewett and Rosenberg () derived an expression for the bias of GLASS and used it to propose an improved method that they termed iGLASS. Here, we derive the biases of SD, STEAC, and MAC, and we propose improved analogues of these methods that we call iSD, iSTEAC, and iMAC. We conduct simulations to compare the performance of these methods with their original counterparts and with GLASS and iGLASS, finding that each of them decreases the bias and mean squared error of pairwise divergence time estimates. The new methods can therefore contribute to improvements in the estimation of species trees from information on gene trees.
  • 机译 VERSE:拼接元素发现的变量效应回归
    摘要:>Identification of splicing regulatory elements (SREs) deserves special attention because these cis-acting short sequences are vital parts of splicing code. The fact that a variety of other biological signals cooperatively govern the splicing pattern indicates the necessity of developing novel tools to incorporate information from multiple sources to improve splicing factor binding sites prediction. Under this context, we proposed a Varying Effect Regression for Splicing Elements (VERSE) to discover intronic SREs in the proximity of exon junctions by integrating other biological features. As a result, 1562 intronic SREs were identified in 16 human tissues, many of which overlapped with experimentally verified binding motifs for several well-known splicing factors, including FOX-1, PTB, hnRNP A/B, hnRNP F/H, and so on. The discovered tissue, region, and conservation preferences of the putative motifs demonstrate that splice site selection is a complicated process that needs subtle and delicate regulation. VERSE may serve as a powerful tool to not only discover SREs by incorporating additional informative signals but also precisely quantify their varying contribution under different biological contexts.
  • 机译 非中性多等位基因模型的有效模拟和似然法
    摘要:>Throughout the 1980s, Simon Tavaré made numerous significant contributions to population genetics theory. As genetic data, in particular DNA sequence, became more readily available, a need to connect population-genetic models to data became the central issue. The seminal work of Griffiths and Tavaré (, , ) was among the first to develop a likelihood method to estimate the population-genetic parameters using full DNA sequences. Now, we are in the genomics era where methods need to scale-up to handle massive data sets, and Tavaré has led the way to new approaches. However, performing statistical inference under non-neutral models has proved elusive. In tribute to Simon Tavaré, we present an article in spirit of his work that provides a computationally tractable method for simulating and analyzing data under a class of non-neutral population-genetic models. Computational methods for approximating likelihood functions and generating samples under a class of allele-frequency based non-neutral parent-independent mutation models were proposed by Donnelly, Nordborg, and Joyce (DNJ) (Donnelly et al., ). DNJ () simulated samples of allele frequencies from non-neutral models using neutral models as auxiliary distribution in a rejection algorithm. However, patterns of allele frequencies produced by neutral models are dissimilar to patterns of allele frequencies produced by non-neutral models, making the rejection method inefficient. For example, in some cases the methods in DNJ () require 109 rejections before a sample from the non-neutral model is accepted. Our method simulates samples directly from the distribution of non-neutral models, making simulation methods a practical tool to study the behavior of the likelihood and to perform inference on the strength of selection.
  • 机译 微生物群落中紧密相关基因组定量的de Bruijn图方法
    摘要:>The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species).
  • 机译 ChIP-Seq数据中用于峰调用的通用线性模型
    摘要:>Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) has become a routine for detecting genome-wide protein-DNA interaction. The success of ChIP-Seq data analysis highly depends on the quality of peak calling (i.e., to detect peaks of tag counts at a genomic location and evaluate if the peak corresponds to a real protein–DNA interaction event). The challenges in peak calling include (1) how to combine the forward and the reverse strand tag data to improve the power of peak calling and (2) how to account for the variation of tag data observed across different genomic locations. We introduce a new peak calling method based on the generalized linear model (GLMNB) that utilizes negative binomial distribution to model the tag count data and account for the variation of background tags that may randomly bind to the DNA sequence at varying levels due to local genomic structures and sequence contents. We allow local shifting of peaks observed on the forward and the reverse stands, such that at each potential binding site, a binding profile representing the pattern of a real peak signal is fitted to best explain the observed tag data with maximum likelihood. Our method can also detect multiple peaks within a local region if there are multiple binding sites in the region.
  • 机译 基因表达数据中线性模式的二聚化
    摘要:>Identifying a bicluster, or submatrix of a gene expression dataset wherein the genes express similar behavior over the columns, is useful for discovering novel functional gene interactions. In this article, we introduce a new algorithm for finding biClusters with Linear Patterns (CLiP). Instead of solely maximizing Pearson correlation, we introduce a fitness function that also considers the correlation of complementary genes and conditions. This eliminates the need for a priori determination of the bicluster size. We employ both greedy search and the genetic algorithm in optimization, incorporating resampling for more robust discovery. When applied to both real and simulation datasets, our results show that CLiP is superior to existing methods. In analyzing RNA-seq fly and worm time-course data from modENCODE, we uncover a set of similarly expressed genes suggesting maternal dependence. is available online (at ).
  • 机译 BLUP基因型归因用于与相关个人和数据丢失的病例对照关联测试
    • 作者:Mary Sara McPeek
    • 刊名:Journal of Computational Biology
    • -1年第6期
    摘要:>We consider the problem of case-control association testing in samples that contain related individuals, where we assume the pedigree structure is known. Typically, for each marker tested, some individuals will have missing genotype data. The MQLS method has been proposed for association testing in this situation. We show that the MQLS method is equivalent to an approach in which missing genotypes are imputed using the best linear unbiased predictor (BLUP) based on relatives' genotype data. Viewed this way, the MQLS exactly corrects for the imputation error and for the extra correlation due to imputation. We also investigate the amount of additional power for detecting association that is provided by this BLUP imputation approach.
  • 机译 HapCompass:序列数据的准确单倍型组装的快速循环基础算法
    摘要:>Genome assembly methods produce haplotype phase ambiguous assemblies due to limitations in current sequencing technologies. Determining the haplotype phase of an individual is computationally challenging and experimentally expensive. However, haplotype phase information is crucial in many bioinformatics workflows such as genetic association studies and genomic imputation. Current computational methods of determining haplotype phase from sequence data—known as haplotype assembly—have difficulties producing accurate results for large (1000 genomes-type) data or operate on restricted optimizations that are unrealistic considering modern high-throughput sequencing technologies. We present a novel algorithm, HapCompass, for haplotype assembly of densely sequenced human genome data. The HapCompass algorithm operates on a graph where single nucleotide polymorphisms (SNPs) are nodes and edges are defined by sequence reads and viewed as supporting evidence of co-occurring SNP alleles in a haplotype. In our graph model, haplotype phasings correspond to spanning trees. We define the minimum weighted edge removal optimization on this graph and develop an algorithm based on cycle basis local optimizations for resolving conflicting evidence. We then estimate the amount of sequencing required to produce a complete haplotype assembly of a chromosome. Using these estimates together with metrics borrowed from genome assembly and haplotype phasing, we compare the accuracy of HapCompass, the Genome Analysis ToolKit, and HapCut for 1000 Genomes Project and simulated data. We show that HapCompass performs significantly better for a variety of data and metrics. HapCompass is freely available for download ().
  • 机译 从最大化期望准确性(MEA)的角度对生物信息学算法进行分类
    摘要:>Many estimation problems in bioinformatics are formulated as point estimation problems in a high-dimensional discrete space. In general, it is difficult to design reliable estimators for this type of problem, because the number of possible solutions is immense, which leads to an extremely low probability for every solution—even for the one with the highest probability. Therefore, maximum score and maximum likelihood estimators do not work well in this situation although they are widely employed in a number of applications. Maximizing expected accuracy (MEA) estimation, in which accuracy measures of the target problem and the entire distribution of solutions are considered, is a more successful approach. In this review, we provide an extensive discussion of algorithms and software based on MEA. We describe how a number of algorithms used in previous studies can be classified from the viewpoint of MEA. We believe that this review will be useful not only for users wishing to utilize software to solve the estimation problems appearing in this article, but also for developers wishing to design algorithms on the basis of MEA.

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号