您现在的位置:首页>美国卫生研究院文献>Journal of Computational Biology

期刊信息

  • 期刊名称:

    -

  • 刊频: Monthly, 2009-
  • NLM标题:
  • iso缩写: -
  • ISSN: -
  • 排序:
  • 显示:
  • 每页:
全选(0
<1/20>
385条结果
  • 机译 免疫信息学方法设计针对幽门螺杆菌的新型基于表位的口服疫苗
    摘要:Helicobacter pylori is an infectious agent that colonizes the gastric mucosa of half of the population worldwide. This bacterium has been recognized as belonging to group 1 carcinogen by the World Health Organization for the role in development of gastritis, peptic ulcers, and cancer. Due to the increase in resistance to antibiotics used in the anti-H. pylori therapy, the development of an effective vaccine is an alternative of great interest, which remains a challenge. Therefore, a rational, strategic, and efficient vaccine design against H. pylori is necessary where the use of the most current bioinformatics tools could help achieve it. In this study, immunoinformatics approach was used to design a novel multiepitope oral vaccine against H. pylori. Our multiepitope vaccine is composed of cholera toxin subunit B (CTB) that is used as a mucosal adjuvant to enhance vaccine immunogenicity for oral immunization. CTB fused to 11 epitopes predicted of pathogenic (UreB170–189, VacA459–478, CagA1103–1122, GGT106–126, NapA30–44, and OipA211–230) and colonization (HpaA33–52, FlaA487–506, FecA437–456, BabA129–149, and SabA540–559) proteins from H. pylori. CKS9 peptide (CKSTHPLSC) targets epithelial microfold cells to enhance vaccine uptake from the gut barrier. All sequences were joined to each other by proper linkers. The vaccine was modeled and validated to achieve a high-quality three-dimensional structure. The vaccine design was evaluated as nonallergenic, antigenic, soluble, and with an appropriate molecular weight and isoelectric point. Our results suggest that our newly designed vaccine could serve as a promising anti-H. pylori vaccine candidate.
  • 机译 ALPHLARD-NT:通过同时分析正常和肿瘤全基因组序列数据进行人白细胞抗原基因分型和突变调用的贝叶斯方法
    摘要:Human leukocyte antigen (HLA) genes provide useful information on the relationship between cancer and the immune system. Despite the ease of obtaining these data through next-generation sequencing methods, interpretation of these relationships remains challenging owing to the complexity of HLA genes. To resolve this issue, we developed a Bayesian method, ALPHLARD-NT, to identify HLA germline and somatic mutations as well as HLA genotypes from whole-exome sequencing (WES) and whole-genome sequencing (WGS) data. ALPHLARD-NT showed 99.2% accuracy for WGS-based HLA genotyping and detected five HLA somatic mutations in 25 colon cancer cases. In addition, ALPHLARD-NT identified 88 HLA somatic mutations, including recurrent mutations and a novel HLA-B type, from WES data of 343 colon adenocarcinoma cases. These results demonstrate the potential of ALPHLARD-NT for conducting an accurate analysis of HLA genes even from low-coverage data sets. This method can become an essential tool for comprehensive analyses of HLA genes from WES and WGS data, helping to advance understanding of immune regulation in cancer as well as providing guidance for novel immunotherapy strategies.
  • 机译 埃里克·戴维森(Eric Davidson)的计算机科学监管基因组:基因组顺式监管法规的因果关系,逻辑和证明原则
    • 作者:Sorin Istrail
    • 刊名:Journal of Computational Biology
    • 2019年第7期
    摘要:In this article, we discuss several computer science problems, inspired by our 15-year-long collaboration with Prof. Eric Davidson, focusing on computer science contributions to the study of the regulatory genome. Our joint study was inspired by his lifetime trailblazing research program rooted in causal gene regulatory networks (GRNs), system completeness, genomic Boolean logic, and genomically encoded regulatory information. We present first four inspiring questions that Eric Davidson asked, and the follow-up, namely, seven technical problems, fully or partially resolved with the methods of computer science. At the center, and unifying the intellectual backbone of those technical challenges, stands “Causality.” Our collaboration produced the causality-inferred cisGRN-Lexicon database, containing the cis-regulatory architecture (CRA) of 600+ transcription factor (TF)-encoding genes and other regulatory genes, in eight species: human, mouse, fruit fly, sea urchin, nematode, rat, chicken, and zebrafish. These CRAs are causality-inferred regulatory regions of genes, derived experimentally through the experimental method called “cis-regulatory analysis” (also known as the “Davidson criteria”). In this research program, causality challenges for computer science show up in two components: (1) how to define data structures that represent the causality-inferred, by the Davidson criteria, DNA structure data and to define a versatile software system to host them; and (2) how to identify by automated software for text analysis the experimental technical articles applying the Davidson criteria to the analysis to genes. We next present the cisGRN-Lexicon Meta-Analysis (Part I). We conclude the article with some reflections on epistemology and philosophy themes concerning the role of causality, logic, and proof in the emerging elegant mathematical theory and practice of the regulatory genome.It is challenging to explain what “explanation” is, and to understand what “understanding” is, when the technical task is to “prove” system-level causality completeness of a 50-gene causal GRN. Within the Peter-Davidson Boolean GRN model, the Peter-Davidson completeness “theorem” provides a seminal answer: Experimental causality system completeness = Computational exact prediction completeness.The article is organized as follows. Section 2 is dedicated to our Prof. Eric Davidson. Section 3 gives a brief introduction for computer scientists to the regulatory genome and its information processing operations in terms similar to the electronic computer. Section 4 proposes to honor Eric Davidson's life-long scientific work on the regulatory genome by naming a most fundamental time unit constant after him. Section 5 presents four grand challenge questions that Eric Davidson asked, and seven follow-up problems inspired by the first two questions, which we fully or partially solved together. Central to the mentioned solutions is our construction of the cisGRN-Lexcion, the database of causally inferred CRA of 600+ regulatory genes in eight species. Section 6 presents Part I of the cisGRN-Lexcion Meta-Analysis, coached as “rules” of the genomic cis-regulatory code. Section 7 is devoted to reflections on epistemological and philosophical themes: causality, logic, and proof in the elegant mathematical modeling of the regulatory genome. We present here the “Davidsonian Causal Systems Biology Axioms,” which guide us toward understanding of the meaning of “proving” causality completeness, for a complex experimental system, by exact computational predictions.
  • 机译 INDEX-db:印度外显子组参考数据库(第一阶段)
    摘要:Deep sequencing-based genetic mapping has greatly enhanced the ability to catalog variants with plausible disease association. Confirming how these identified variants contribute to specific disease conditions, across human populations, poses the next challenge. Differential selection pressure may impact the frequency of genetic variations, and thus detection of association with disease conditions, across populations. To understand genotype to phenotype correlations, it thus becomes important to first understand the spectrum of genetic variation within a population by creating a reference map. In this study, we report the development of phase I of a new database of genetic variations called INDian EXome database (INDEX-db), from the Indian population, with an aim to establish a centralized database of integrated information. This could be useful for researchers involved in studying disease mechanisms at clinical, genetic, and cellular levels.
  • 机译 具有缺失值的数据集的多元分析:基于信息论的可靠性函数
    摘要:Missing values in complex biological data sets have significant impacts on our ability to correctly detect and quantify interactions in biological systems and to infer relationships accurately. In this article, we propose a useful metaphor to show that information theory measures, such as mutual information and interaction information, can be employed directly for evaluating multivariable dependencies even if data contain some missing values. The metaphor is that of thinking of variable dependencies as information channels between and among variables. In this view, missing data can be thought of as noise that reduces the channel capacity in predictable ways. We extract the available information in the data even if there are missing values and use the notion of channel capacity to assess the reliability of the result. This avoids the common practice—in the absence of prior knowledge of random imputation—of eliminating samples entirely, thus losing the information they can provide. We show how this reliability function can be implemented for pairs of variables, and generalize it for an arbitrary number of variables. Illustrations of the reliability functions for several cases are provided using simulated data.
  • 机译 从格雷戈尔·孟德尔到埃里克·戴维森:生物学中的数学模型和基本原理
    • 作者:Ute Deichmann
    • 刊名:Journal of Computational Biology
    • 2019年第7期
    摘要:Mathematical models have been widespread in biology since its emergence as a modern experimental science in the 19th century. Focusing on models in developmental biology and heredity, this article (1) presents the properties and epistemological basis of pertinent mathematical models in biology from Mendel's model of heredity in the 19th century to Eric Davidson's model of developmental gene regulatory networks in the 21st; (2) shows that the models differ not only in their epistemologies but also in regard to explicitly or implicitly taking into account basic biological principles, in particular those of biological specificity (that became, in part, replaced by genetic information) and genetic causality. The article claims that models disregarding these principles did not impact the direction of biological research in a lasting way, although some of them, such as D'Arcy Thompson's models of biological form, were widely read and admired and others, such as Turing's models of development, stimulated research in other fields. Moreover, it suggests that successful models were not purely mathematical descriptions or simulations of biological phenomena but were based on inductive, as well as hypothetico-deductive, methodology. The recent availability of large amounts of sequencing data and new computational methodology tremendously facilitates system approaches and pattern recognition in many fields of research. Although these new technologies have given rise to claims that correlation is replacing experimentation and causal analysis, the article argues that the inductive and hypothetico-deductive experimental methodologies have remained fundamentally important as long as causal-mechanistic explanations of complex systems are pursued.
  • 机译 孟德尔不一致的签名,从1314截然不同的家庭三重奏区分序列变异的生物学差异。
    摘要:Next-generation sequencing enables advances in the clinical application of genomics by providing high-throughput detection of genomic variation. However, next-generation sequencing technologies, especially whole-genome sequencing (WGS), are often associated with a high false-positive rate. Trio-based WGS can contribute significantly towards improved quality control methods. Mendelian-inconsistent calls (MIC) in parent–child trios are commonly attributed to erroneous sequencing calls, as the true de novo mutation rate is extremely low compared with MIC incidence. Here, we analyzed WGS data from 1314 mother, father, and child trios across ethnically diverse populations with the goal of characterizing MIC. Genotype calls in a trio can be used to assign different signatures to MIC. MIC occur more frequently within repeats but show varying distribution and error mechanisms across repeat types. MIC are enriched within poly-A/T runs in short interspersed nuclear elements. Alignability scores, allele balance, and relative parental read depth vary among MIC signatures and these differences should be considered when designing filters for MIC reduction. MIC cluster in germline deletions and these MIC also segregate with population. Our results provide a basis for making decisions on how each MIC type should be evaluated before discarding them as errors or including them in alternative applications. With the reduction of sequencing cost, family trio whole genome and exome analysis are being performed more routinely in clinical practice. We provide a reference that can be used for annotating MIC with their frequencies in a larger population to aid in the filtering of candidate de novo mutations.
  • 机译 列联表中的方向关联测量:基因组案例
    摘要:Analysis of large data sets is currently a major challenge. Strong efforts are being undertaken to tackle this problem by developing new methods or modifying existing ones. The Z association method is a new method for describing directional association in contingency tables. It allows to arbitrarily group categories for each of the two variables, for which the contingency table is analyzed. The Z coefficient was calculated on a sample data set with gene mutations in different cancer types. Results showed some association with both gene mutations and annotation groups. Detailed results obtained for particular cancer types versus particular genes and annotation groups were in line with well-known facts in cancer genomics. The “MEUSassociation” R library allows to analyze the directional association between two categorical variables, and the mutual relationship is summarized in a contingency table, by means of the Z association coefficient. The method implemented in the library allows to compute the standard Z coefficient and to apply it in a case, where all possible singular coefficients Z(A:B) are computed at the same time, giving information of association between particular rows and columns. Investigating the ranked list of the highest singular coefficients allows to reduce the complexity of a large-scale data set. Both the Z coefficient and its R implementation are important tools in categorical data analysis.
  • 机译 Joker de Bruijn:使用小丑角色覆盖k-Mers
    摘要:Sequence libraries that cover all k-mers enable universal and unbiased measurements of nucleotide and peptide binding. The shortest sequence to cover all k-mers is a de Bruijn sequence of length . Researchers would like to increase k to measure interactions at greater detail, but face a challenging problem: the number of k-mers grows exponentially in k, while the space on the experimental device is limited. In this study, we introduce a novel advance to shrink k-mer library sizes by using joker characters, which represent all characters in the alphabet. Theoretically, the use of joker characters can reduce the library size tremendously, but it should be limited as the introduced degeneracy lowers the statistical robustness of measurements. In this work, we consider the problem of generating a minimum-length sequence that covers a given set of k-mers using joker characters. The number and positions of the joker characters are provided as input. We first prove that the problem is NP-hard. We then present the first solution to the problem, which is based on two algorithmic innovations: (1) a greedy heuristic and (2) an integer linear programming (ILP) formulation. We first run the heuristic to find a good feasible solution, and then run an ILP solver to improve it. We ran our algorithm on DNA and amino acid alphabets to cover all k-mers for different values of k and k-mer multiplicity. Results demonstrate that it produces sequences that are very close to the theoretical lower bound.
  • 机译 具有分层池深度卷积特征的视频中的细胞动力学计算分析
    摘要:Computational analysis of cellular appearance and its dynamics is used to investigate physiological properties of cells in biomedical research. In consideration of the great success of deep learning in video analysis, we first introduce two-stream convolutional networks (ConvNets) to automatically learn the biologically meaningful dynamics from raw live-cell videos. However, the two-stream ConvNets lack the ability to capture long-range video evolution. Therefore, a novel hierarchical pooling strategy is proposed to model the cell dynamics in a whole video, which is composed of trajectory pooling for short-term dynamics and rank pooling for long-range ones. Experimental results demonstrate that the proposed pipeline effectively captures the spatiotemporal dynamics from the raw live-cell videos and outperforms existing methods on our cell video database.
  • 机译 超级气泡,超气泡和仙人掌
    摘要:A superbubble is a type of directed acyclic subgraph with single distinct source and sink vertices. In genome assembly and genetics, the possible paths through a superbubble can be considered to represent the set of possible sequences at a location in a genome. Bidirected and biedged graphs are a generalization of digraphs that are increasingly being used to more fully represent genome assembly and variation problems. In this study, we define snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and give an efficient algorithm for the detection of these more general structures. Key to this algorithm is the cactus graph, which, we show, encodes the nested decomposition of a graph into snarls and ultrabubbles within its structure. We propose and demonstrate empirically that this decomposition on bidirected and biedged graphs solves a fundamental problem by defining genetic sites for any collection of genomic variations, including complex structural variations, without need for any single reference genome coordinate system. Further, the nesting of the decomposition gives a natural way to describe and model variations contained within large variations, a case not currently dealt with by existing formats [e.g., variant cell format (VCF)].
  • 机译 基因组序列图线性化的流程
    摘要:Efforts to incorporate human genetic variation into the reference human genome have converged on the idea of a graph representation of genetic variation within a species, a genome sequence graph. A sequence graph represents a set of individual haploid reference genomes as paths in a single graph. When that set of reference genomes is sufficiently diverse, the sequence graph implicitly contains all frequent human genetic variations, including translocations, inversions, deletions, and insertions. In representing a set of genomes as a sequence graph, one encounters certain challenges. One of the most important is the problem of graph linearization, essential both for efficiency of storage and access, and for natural graph visualization and compatibility with other tools. The goal of graph linearization is to order nodes of the graph in such a way that operations such as access, traversal, and visualization are as efficient and effective as possible. A new algorithm for the linearization of sequence graphs, called the flow procedure (FP), is proposed in this article. Comparative experimental evaluation of the FP against other algorithms shows that it outperforms its rivals in the metrics most relevant to sequence graphs.
  • 机译 改进的大型转录组测序数据库的搜索使用拆分序列绽放树
    摘要:Enormous databases of short-read RNA-seq experiments such as the NIH Sequencing Read Archive are now available. These databases could answer many questions about condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. Although some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called split sequence bloom trees (SSBTs) to support sequence-based querying of terabyte scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the sequence bloom tree (SBT) data structure for the same task. We apply SSBTs to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in <4 minutes using a single thread and can be stored in just 39 GB, a fivefold improvement in search and storage costs compared with SBT.
  • 机译 致力于恢复等位基因特异性癌症基因组图
    摘要:Integrated analysis of structural variants (SVs) and copy number alterations in aneuploid cancer genomes is key to understanding tumor genome complexity. A recently developed algorithm, Weaver, can estimate, for the first time, allele-specific copy number of SVs and their interconnectivity in aneuploid cancer genomes. However, one major limitation is that not all SVs identified by Weaver are phased. In this article, we develop a general convex programming framework that predicts the interconnectivity of unphased SVs with possibly noisy allele-specific copy number estimations as input. We demonstrated through applications to both simulated data and HeLa whole-genome sequencing data that our method is robust to the noise in the input copy numbers and can predict SV phasings with high specificity. We found that our method can make consistent predictions with Weaver even if a large proportion of the input variants are unphased. We also applied our method to The Cancer Genome Atlas (TCGA) ovarian cancer whole-genome sequencing samples to phase SVs left unphased by Weaver. Our work provides an important new algorithmic framework for recovering more complete allele-specific cancer genome graphs.
  • 机译 要进行t检验还是不进行t检验?接收机工作特性曲线框架中基于p值的观点
    摘要:A common statistical doctrine supported by many introductory courses and textbooks is that t-test type procedures based on normally distributed data points are anticipated to provide a standard in decision-making. In order to motivate scholars to examine this convention, we introduce a simple approach based on graphical tools of receiver operating characteristic (ROC) curve analysis, a well-established biostatistical methodology. In this context, we propose employing a p-values-based method, taking into account the stochastic nature of p-values. We focus on the modern statistical literature to address the expected p-value (EPV) as a measure of the performance of decision-making rules. During the course of our study, we extend the EPV concept to be considered in terms of the ROC curve technique. This provides expressive evaluations and visualizations of a wide spectrum of testing mechanisms' properties. We show that the conventional power characterization of tests is a partial aspect of the presented EPV/ROC technique. We desire that this explanation of the EPV/ROC approach convinces researchers of the usefulness of the EPV/ROC approach for depicting different characteristics of decision-making procedures, in light of the growing interest regarding correct p-values-based applications.
  • 机译 差异表达基因列表的简单比较分析可能会高估基因重叠
    摘要:Comparing the overlap between sets of differentially expressed genes (DEGs) within or between transcriptome studies is regularly used to infer similarities between biological processes. Significant overlap between two sets of DEGs is usually determined by a simple test. The number of potentially overlapping genes is compared to the number of genes that actually occur in both lists, treating every gene as equal. However, gene expression is controlled by transcription factors that bind to a variable number of transcription factor binding sites, leading to variation among genes in general variability of their expression. Neglecting this variability could therefore lead to inflated estimates of significant overlap between DEG lists. With computer simulations, we demonstrate that such biases arise from variation in the control of gene expression. Significant overlap commonly arises between two lists of DEGs that are randomly generated, assuming that the control of gene expression is variable among genes but consistent between corresponding experiments. More overlap is observed when transcription factors are specific to their binding sites and when the number of genes is considerably higher than the number of different transcription factors. In contrast, overlap between two DEG lists is always lower than expected when the genetic architecture of expression is independent between the two experiments. Thus, the current methods for determining significant overlap between DEGs are potentially confounding biologically meaningful overlap with overlap that arises due to variability in control of expression among genes, and more sophisticated approaches are needed.
  • 机译 POPSTR:基于单核苷酸多态性和拷贝数变异的混合种群结构推断。
    摘要:Statistical approaches for population structure estimation have been predominantly driven by a particular data type, single-nucleotide polymorphisms (SNPs). However, in the presence of weak identifiability in SNPs, population structure estimation can suffer from undesirable accuracy loss. Copy number variations (CNVs) are genomic structural variants with loci that are commonly shared within a specific population and thus provide valuable information for estimation of the ancestry of sampled populations. We develop a Bayesian joint modeling framework of SNPs and CNVs, called POPSTR, to better understand population structure than approaches that use SNPs solely. To deal with the increased data volume, we use the Metropolis Adjusted Langevin algorithm (MALA) that guides the target distribution in a computationally efficient way. We illustrate applications of our approach using the HapMap 2005 project data. We carry out simulation studies and show that the performance of our approach is comparable or better than that of popular benchmarks, STRUCTURE and ADMIXTURE. We also observe that using only CNVs can be remarkably efficient if SNP data are not available.
  • 机译 R平方分裂规则的生存森林
    摘要:In modeling censored data, survival forest models are a competitive nonparametric alternative to traditional parametric or semiparametric models when the function forms are possibly misspecified or the underlying assumptions are violated. In this work, we propose a survival forest approach with trees constructed using a novel pseudo R2 splitting rules. By studying the well-known benchmark data sets, we find that the proposed model generally outperforms popular survival models such as random survival forest with different splitting rules, Cox proportional hazard model, and generalized boosted model in terms of C-index metric.
  • 机译 初始聚类分析
    摘要:We study a simple abstract problem motivated by a variety of applications in protein sequence analysis. Consider a string of 0s and 1s of length L, and containing D 1s. If we believe that some or all of the 1s may be clustered near the start of the sequence, which subset is the most significantly so clustered, and how significant is this clustering? We approach this question using the minimum description length principle and illustrate its application by analyzing residues that distinguish translational initiation and elongation factor guanosine triphosphatases (GTPases) from other P-loop GTPases. Within a structure of yeast elongation factor 1, these residues form a significant cluster centered on a region implicated in guanine nucleotide exchange. Various biomedical questions may be cast as the abstract problem considered here.
  • 机译 刚体转化的生物分子结构的对称参数化。
    摘要:Assessing preferred relative rigid body position and orientation is important in the description of biomolecular structures (such as proteins) and their interactions. In this article, we extend and apply the “symmetrical parameterization,” which we recently introduced in the kinematics community, to address problems in structural biology. We also review parameterization methods that are widely used in structural biology to describe relative rigid body motions (in particular, orientations) as a basis for comparison. The new symmetrical parameterization is useful in describing the relative biomolecular rigid body motions, where the parameters are symmetrical in the sense that the subunits of a complex biomolecular structure are described in the same way for the corresponding motion and its inverse. The properties of this new parameterization, singularity analysis, and inverse kinematics are also investigated in more detail. Finally, parameterization is applied to real biomolecular structures and a potential application to structure modeling of symmetric macromolecules to show the efficacy of the symmetrical parameterization in the field of computational structural biology.

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号