您现在的位置:首页>美国卫生研究院文献>Journal of Computational Biology

期刊信息

  • 期刊名称:

    -

  • 刊频: Monthly, 2009-
  • NLM标题:
  • iso缩写: -
  • ISSN: -
  • 排序:
  • 显示:
  • 每页:
全选(0
<8/20>
388条结果
  • 机译 描述系统的复杂性:多变量“集合复杂性”和系统生物学的信息基础
    摘要:Context dependence is central to the description of complexity. Keying on the pairwise definition of “set complexity,” we use an information theory approach to formulate general measures of systems complexity. We examine the properties of multivariable dependency starting with the concept of interaction information. We then present a new measure for unbiased detection of multivariable dependency, “differential interaction information.” This quantity for two variables reduces to the pairwise “set complexity” previously proposed as a context-dependent measure of information in biological systems. We generalize it here to an arbitrary number of variables. Critical limiting properties of the “differential interaction information” are key to the generalization. This measure extends previous ideas about biological information and provides a more sophisticated basis for the study of complexity. The properties of “differential interaction information” also suggest new approaches to data analysis. Given a data set of system measurements, differential interaction information can provide a measure of collective dependence, which can be represented in hypergraphs describing complex system interaction patterns. We investigate this kind of analysis using simulated data sets. The conjoining of a generalized set complexity measure, multivariable dependency analysis, and hypergraphs is our central result. While our focus is on complex biological systems, our results are applicable to any complex system.
  • 机译 互动关系本体学习
    摘要:Ontology is widely used in semantic computing and reasoning, and various biomedicine ontologies have become institutionalized to make the heterogeneous knowledge computationally amenable. Relation words, especially verbs, play an important role when describing the interaction between biological entities in molecular function, biological process, and cellular component; however, comprehensive research and analysis are still lacking. In this article, we propose an automatic method to build interaction relation ontology by investigating relation verbs, analyzing the syntactic relation of PubMed abstracts to perform relation vocabulary expansion, and integrating WordNet into our method to construct the hierarchy of relation vocabulary. Five attributes are populated automatically for each word in interaction relation ontology. As a result, the interaction relation ontology is constructed; it contains a total of 963 words and covers the most relation words used in existing methods of proteins interaction relation.
  • 机译 使用isoform图从RNA-Seq数据建模替代剪接变体
    摘要:Next-generation sequencing (NGS) technologies need new methodologies for alternative splicing (AS) analysis. Current computational methods for AS analysis from NGS data are mainly based on aligning short reads against a reference genome, while methods that do not need a reference genome are mostly underdeveloped. In this context, the main developed tools for NGS data focus on de novo transcriptome assembly (Grabherr et al., ; Schulz et al., ). While these tools are extensively applied for biological investigations and often show intrinsic shortcomings from the obtained results, a theoretical investigation of the inherent computational limits of transcriptome analysis from NGS data, when a reference genome is unknown or highly unreliable, is still missing. On the other hand, we still lack methods for computing the gene structures due to AS events under the above assumptions—a problem that we start to tackle with this article. More precisely, based on the notion of isoform graph (Lacroix et al., ), we define a compact representation of gene structures—called splicing graph—and investigate the computational problem of building a splicing graph that is (i) compatible with NGS data and (ii) isomorphic to the isoform graph. We characterize when there is only one representative splicing graph compatible with input data, and we propose an efficient algorithmic approach to compute this graph.
  • 机译 基于基因本体的加权多径测量估计基因产物相似性
    摘要:Many different methods have been proposed for calculating the semantic similarity of term pairs based on gene ontology (GO). Most existing methods are based on information content (IC), and the methods based on IC are used more commonly than those based on the structure of GO. However, most IC-based methods not only fail to handle identical annotations but also show a strong bias toward well-annotated proteins. We propose a new method called weighted multipath measurement (WMM) for estimating the semantic similarity of gene products based on the structure of the GO. We not only considered the contribution of every path between two GO terms but also took the depth of the lowest common ancestors into account. We assigned different weights for different kinds of edges in GO graph. The similarity values calculated by WMM can be reused because they are only relative to the characteristics of GO terms. Experimental results showed that the similarity values obtained by WMM have a higher accuracy. We compared the performance of WMM with that of other methods using GO data and gene annotation datasets for yeast and humans downloaded from the GO database. We found that WMM is more suited for prediction of gene function than most existing IC-based methods and that it can distinguish proteins with identical annotations (two proteins are annotated with the same terms) from each other.
  • 机译 用de Bruijn图识别拼接调控元件
    摘要:Splicing regulatory elements (SREs) are short, degenerate sequences on pre-mRNA molecules that enhance or inhibit the splicing process via the binding of splicing factors, proteins that regulate the functioning of the spliceosome. Existing methods for identifying SREs in a genome are either experimental or computational. Here, we propose a formalism based on de Bruijn graphs that combines genomic structure, word count enrichment analysis, and experimental evidence to identify SREs found in exons. In our approach, SREs are not restricted to a fixed length (i.e., k-mers, for a fixed k). As a result, we identify 2001 putative exonic enhancers and 3080 putative exonic silencers for human genes, with lengths varying from 6 to 15 nucleotides. Many of the predicted SREs overlap with experimentally verified binding sites. Our model provides a novel method to predict variable length putative regulatory elements computationally for further experimental investigation.
  • 机译 用小波方法检测富集区域并探索表观基因组景观
    摘要:Epigenetic landscapes represent how cells regulate gene activity. To understand their effect on gene regulation, it is important to detect their occupancy in the genome. Unlike transcription factors whose binding regions are limited to narrow regions, histone modification marks are enriched over broader areas. The stochastic characteristics unique to each mark make it hard to detect their enrichment. Classically, a predefined window has been used to detect their enrichment. However, these approaches heavily rely on the predetermined parameters. Also, the window-based approaches cannot handle the enrichment of multiple marks. We propose a novel algorithm, called SeqW, to detect enrichment of multiple histone modification marks. SeqW applies a zooming approach to detect a broadly enriched domain. The zooming approach helps domain detection by increasing signal-to-noise ratio. The borders of the domains are detected by studying the characteristics of signals in the wavelet domain. We show that SeqW outperformed previous predictors in detecting broad peaks. Also, we applied SeqW in studying spatial combinations of histone modification patterns.
  • 机译 最大简约性的最坏情况复杂度
    摘要:One of the core classical problems in computational biology is that of constructing the most parsimonious phylogenetic tree interpreting an input set of sequences from the genomes of evolutionarily related organisms. We reexamine the classical maximum parsimony (MP) optimization problem for the general (asymmetric) scoring matrix case, where rooted phylogenies are implied, and analyze the worst case bounds of three approaches to MP: The approach of Cavalli-Sforza and Edwards, the approach of Hendy and Penny, and a new agglomerative, “bottom-up” approach we present in this article. We show that the second and third approaches are faster than the first one by a factor of and Θ(n), respectively, where n is the number of species.
  • 机译 具有网络结构的数据的贝叶斯盲源分离
    摘要:In biology, more and more information about the interactions in regulatory systems becomes accessible, and this often leads to prior knowledge for recent data interpretations. In this work we focus on multivariate signaling data, where the structure of the data is induced by a known regulatory network. To extract signals of interest we assume a blind source separation (BSS) model, and we capture the structure of the source signals in terms of a Bayesian network. To keep the parameter space small, we consider stationary signals, and we introduce the new algorithm emGrade, where model parameters and source signals are estimated using expectation maximization. For network data, we find an improved estimation performance compared to other BSS algorithms, and the flexible Bayesian modeling enables us to deal with repeated and missing observation values. The main advantage of our method is the statistically interpretable likelihood, and we can use model selection criteria to determine the (in general unknown) number of source signals or decide between different given networks. In simulations we demonstrate the recovery of the source signals dependent on the graph structure and the dimensionality of the data.
  • 机译 没有矩阵幂的系统发育随机映射
    摘要:Phylogenetic stochastic mapping is a method for reconstructing the history of trait changes on a phylogenetic tree relating species/organism carrying the trait. State-of-the-art methods assume that the trait evolves according to a continuous-time Markov chain (CTMC) and works well for small state spaces. The computations slow down considerably for larger state spaces (e.g., space of codons), because current methodology relies on exponentiating CTMC infinitesimal rate matrices—an operation whose computational complexity grows as the size of the CTMC state space cubed. In this work, we introduce a new approach, based on a CTMC technique called uniformization, which does not use matrix exponentiation for phylogenetic stochastic mapping. Our method is based on a new Markov chain Monte Carlo (MCMC) algorithm that targets the distribution of trait histories conditional on the trait data observed at the tips of the tree. The computational complexity of our MCMC method grows as the size of the CTMC state space squared. Moreover, in contrast to competing matrix exponentiation methods, if the rate matrix is sparse, we can leverage this sparsity and increase the computational efficiency of our algorithm further. Using simulated data, we illustrate advantages of our MCMC algorithm and investigate how large the state space needs to be for our method to outperform matrix exponentiation approaches. We show that even on the moderately large state space of codons our MCMC method can be significantly faster than currently used matrix exponentiation methods.
  • 机译 基于连续时间马尔可夫链的代谢通量分析
    摘要:Metabolic flux analysis (MFA), a key technology in bioinformatics, is an effective way of analyzing the entire metabolic system by measuring fluxes. Many existing MFA approaches are based on differential equations, which are complicated to be solved mathematically. So MFA requires some simple approaches to investigate metabolism further. In this article, we applied continuous-time Markov chain to MFA, called MMFA approach, and transformed the MFA problem into a set of quadratic equations by analyzing the transition probability of each carbon atom in the entire metabolic system. Unlike the other methods, MMFA analyzes the metabolic model only through the transition probability. This approach is very generic and it could be applied to any metabolic system if all the reaction mechanisms in the system are known. The results of the MMFA approach were compared with several chemical reaction equilibrium constants from early experiments by taking pentose phosphate pathway as an example.
  • 机译 具有相互约束的交替RNA结构的同时折叠:在基于下一代测序的RNA结构探测中的应用
    摘要:Recent advances in next-generation sequencing technology have significantly promoted high-throughput experimental probing of RNA secondary structures. The resulting enzymatic or chemical probing information is then incorporated into a minimum free energy folding algorithm to predict more accurate RNA secondary structures. A drawback of this approach is that it does not consider the presence of alternative RNA structures. In addition, the alternative RNA structures may contaminate experimental probing information of each other and direct the minimum free-energy folding to a wrong direction. In this article, we present a combinatorial solution for this problem, where two alternative structures can be folded simultaneously given the experimental probing information regarding the mixture of these two alternative structures. We have tested our algorithm with artificially generated mixture probing data on adenine riboswitch and thiamine pyrophosphate (TPP) riboswitch. The experimental results show that our algorithm can successfully recover the ON and OFF structures of these riboswitches.
  • 机译 蛋白质序列的同时比对和折叠
    摘要:Accurate comparative analysis tools for low-homology proteins remains a difficult challenge in computational biology, especially sequence alignment and consensus folding problems. We present partiFold-Align, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm's complexity is polynomial in time and space. Algorithmically, partiFold-Align exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane β-barrel proteins, an important yet difficult class of proteins with few known three-dimensional structures. Testing against structurally derived sequence alignments, partiFold-Align significantly outperforms state-of-the-art pairwise and multiple sequence alignment tools in the most difficult low-sequence homology case. It also improves secondary structure prediction where current approaches fail. Importantly, partiFold-Align requires no prior training. These general techniques are widely applicable to many more protein families (partiFold-Align is available at ).
  • 机译 表观遗传学中的课堂学习失衡
    摘要:In machine learning, one of the important criteria for higher classification accuracy is a balanced dataset. Datasets with a large ratio between minority and majority classes face hindrance in learning using any classifier. Datasets having a magnitude difference in number of instances between the target concept result in an imbalanced class distribution. Such datasets can range from biological data, sensor data, medical diagnostics, or any other domain where labeling any instances of the minority class can be time-consuming or costly or the data may not be easily available. The current study investigates a number of imbalanced class algorithms for solving the imbalanced class distribution present in epigenetic datasets. Epigenetic (DNA methylation) datasets inherently come with few differentially DNA methylated regions (DMR) and with a higher number of non-DMR sites. For this class imbalance problem, a number of algorithms are compared, including the TAN+AdaBoost algorithm. Experiments performed on four epigenetic datasets and several known datasets show that an imbalanced dataset can have similar accuracy as a regular learner on a balanced dataset.
  • 机译 通过网络重建阐明流感抑制途径
    摘要:Viruses evade detection by the host immune system through the suppression of antiviral pathways. These pathways are thus obscured when measuring the host response to viral infection and cannot be inferred by current network reconstruction methodology. Here we aim to close this gap by providing a novel computational framework for the inference of such inhibited pathways as well as the proteins targeted by the virus to achieve this inhibition. We demonstrate the power of our method by testing it on the response to influenza infection in humans, with and without the viral inhibitory protein NS1, revealing its direct targets and their inhibitory effects.
  • 机译 转录调控的信息要求
    摘要:Transcription factors (TFs) regulate transcription by binding to specific sites in promoter regions. Information theory provides a useful mathematical framework to analyze the binding motifs associated with TFs but imposes several assumptions that limit their applicability to specific regulatory scenarios. Explicit simulations of the co-evolution of TFs and their binding motifs allow the study of the evolution of regulatory networks with a high degree of realism. In this work we analyze the impact of differential regulatory demands on the information content of TF-binding motifs by means of evolutionary simulations. We generalize a predictive index based on information theory, and we validate its applicability to regulatory scenarios in which the TF binds significantly to the genomic background. Our results show a logarithmic dependence of the evolved information content on the occupancy of target sites and indicate that TFs may actively exploit pseudo-sites to modulate their occupancy of target sites. In regulatory networks with differentially regulated targets, we observe that information content in TF-binding motifs is dictated primarily by the fraction of total probability mass that the TF assigns to its target sites, and we provide a predictive index to estimate the amount of information associated with arbitrarily complex regulatory systems. We observe that complex regulatory patterns can exert additional demands on evolved information content, but, given a total occupancy for target sites, we do not find conclusive evidence that this effect is because of the range of required binding affinities.
  • 机译 通过探索稀疏模型空间寻找替代表达定量性状位点
    摘要:Sparse modeling, a feature selection method widely used in the machine-learning community, has been recently applied to identify associations in genetic studies including expression quantitative trait locus (eQTL) mapping. These genetic studies usually involve high dimensional data where the number of features is much larger than the number of samples. The high dimensionality of genetic data introduces a problem that there exist multiple solutions for optimizing a sparse model. In such situations, a single optimization result provides only an incomplete view of the data and lacks power to find alternative features associated with the same trait. In this article, we propose a novel method aimed to detecting alternative eQTLs where two genetic variants have alternative relationships regarding their associations with the expression of a particular gene. Our method accomplishes this goal by exploring multiple solutions sampled from the solution space. We proved our method theoretically and demonstrated its usage on simulated data. We then applied our method to a real eQTL data and identified a set of alternative eQTLs with potential biological insights. Additionally, these alternative eQTLs implicate a network view of understanding gene regulation.
  • 机译 评估,比较和解释蛋白质域层次结构
    • 作者:Andrew F. Neuwald
    • 刊名:Journal of Computational Biology
    • 2014年第4期
    摘要:Arranging protein domain sequences hierarchically into evolutionarily divergent subgroups is important for investigating evolutionary history, for speeding up web-based similarity searches, for identifying sequence determinants of protein function, and for genome annotation. However, whether or not a particular hierarchy is optimal is often unclear, and independently constructed hierarchies for the same domain can often differ significantly. This article describes methods for statistically evaluating specific aspects of a hierarchy, for probing the criteria underlying its construction and for direct comparisons between hierarchies. Information theoretical notions are used to quantify the contributions of specific hierarchical features to the underlying statistical model. Such features include subhierarchies, sequence subgroups, individual sequences, and subgroup-associated signature patterns. Underlying properties are graphically displayed in plots of each specific feature's contributions, in heat maps of pattern residue conservation, in “contrast alignments,” and through cross-mapping of subgroups between hierarchies. Together, these approaches provide a deeper understanding of protein domain functional divergence, reveal uncertainties caused by inconsistent patterns of sequence conservation, and help resolve conflicts between competing hierarchies.
  • 机译 并行连续流:用于全基因组的并行后缀树构建工具
    摘要:The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes.
  • 机译 在关联分析中利用基因组结构
    摘要:A genome-wide association study involves examining a large number of single-nucleotide polymorphisms (SNPs) to identify SNPs that are significantly associated with the given phenotype, while trying to reduce the false positive rate. Although haplotype-based association methods have been proposed to accommodate correlation information across nearby SNPs that are in linkage disequilibrium, none of these methods directly incorporated the structural information such as recombination events along chromosome. In this paper, we propose a new approach called stochastic block lasso for association mapping that exploits prior knowledge on linkage disequilibrium structure in the genome such as recombination rates and distances between adjacent SNPs in order to increase the power of detecting true associations while reducing false positives. Following a typical linear regression framework with the genotypes as inputs and the phenotype as output, our proposed method employs a sparsity-enforcing Laplacian prior for the regression coefficients, augmented by a first-order Markov process along the sequence of SNPs that incorporates the prior information on the linkage disequilibrium structure. The Markov-chain prior models the structural dependencies between a pair of adjacent SNPs, and allows us to look for association SNPs in a coupled manner, combining strength from multiple nearby SNPs. Our results on HapMap-simulated datasets and mouse datasets show that there is a significant advantage in incorporating the prior knowledge on linkage disequilibrium structure for marker identification under whole-genome association.
  • 机译 贝叶斯采样器,用于优化蛋白质结构域层次结构
    • 作者:Andrew F. Neuwald
    • 刊名:Journal of Computational Biology
    • 2014年第3期
    摘要:The process of identifying and modeling functionally divergent subgroups for a specific protein domain class and arranging these subgroups hierarchically has, thus far, largely been done via manual curation. How to accomplish this automatically and optimally is an unsolved statistical and algorithmic problem that is addressed here via Markov chain Monte Carlo sampling. Taking as input a (typically very large) multiple-sequence alignment, the sampler creates and optimizes a hierarchy by adding and deleting leaf nodes, by moving nodes and subtrees up and down the hierarchy, by inserting or deleting internal nodes, and by redefining the sequences and conserved patterns associated with each node. All such operations are based on a probability distribution that models the conserved and divergent patterns defining each subgroup. When we view these patterns as sequence determinants of protein function, each node or subtree in such a hierarchy corresponds to a subgroup of sequences with similar biological properties. The sampler can be applied either de novo or to an existing hierarchy. When applied to 60 protein domains from multiple starting points in this way, it converged on similar solutions with nearly identical log-likelihood ratio scores, suggesting that it typically finds the optimal peak in the posterior probability distribution. Similarities and differences between independently generated, nearly optimal hierarchies for a given domain help distinguish robust from statistically uncertain features. Thus, a future application of the sampler is to provide confidence measures for various features of a domain hierarchy.

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号