您现在的位置:首页>美国卫生研究院文献>Systematic Biology

期刊信息

  • 期刊名称:

    -

  • 刊频: Six no. a year, 2001-
  • NLM标题: Syst Biol
  • iso缩写: -
  • ISSN: -

年度选择

更多>>

  • 排序:
  • 显示:
  • 每页:
全选(0
<1/12>
222条结果
  • 机译 交换生与死:系统动力学模型的对称性和转换
    摘要:Stochastic birth–death models provide the foundation for studying and simulating evolutionary trees in phylodynamics. A curious feature of such models is that they exhibit fundamental symmetries when the birth and death rates are interchanged. In this article, we first provide intuitive reasons for these known transformational symmetries. We then show that these transformational symmetries (encoded in algebraic identities) are preserved even when individuals at the present are sampled with some probability. However, these extended symmetries require the death rate parameter to sometimes take a negative value. In the last part of this article, we describe the relevance of these transformations and their application to computational phylodynamics, particularly to maximum likelihood and Bayesian inference methods, as well as to model selection.
  • 机译 种群变异,岛屿分类群的年龄和散布:使用太平洋植物Planchonella(Sapotaceae)的案例研究
    摘要:Oceanic islands originate from volcanism or tectonic activity without connections to continental landmasses, are colonized by organisms, and eventually vanish due to erosion and subsidence. Colonization of oceanic islands occurs through long-distance dispersals (LDDs) or metapopulation vicariance, the latter resulting in lineages being older than the islands they inhabit. If metapopulation vicariance is valid, island ages cannot be reliably used to provide maximum age constraints for molecular dating. We explore the relationships between the ages of members of a widespread plant genus (Planchonella, Sapotaceae) and their host islands across the Pacific to test various assumptions of dispersal and metapopulation vicariance. We sampled three nuclear DNA markers from 156 accessions representing some 100 Sapotaceae taxa, and analyzed these in BEAST with a relaxed clock to estimate divergence times and with a phylogeographic diffusion model to estimate range expansions over time. The phylogeny was calibrated with a secondary point (the root) and fossils from New Zealand. The dated phylogeny reveals that the ages of Planchonella species are, in most cases, consistent with the ages of the islands they inhabit. Planchonella is inferred to have originated in the Sahul Shelf region, to which it back-dispersed multiple times. Fiji has been an important source for range expansion in the Pacific for the past 23 myr. Our analyses reject metapopulation vicariance in all cases tested, including between oceanic islands, evolution of an endemic Fiji–Vanuatu flora, and westward rollback vicariance between Vanuatu and the Loyalty Islands. Repeated dispersal is the only mechanism able to explain the empirical data. The longest (8900 km) identified dispersal is between Palau in the Pacific and the Seychelles in the Indian Ocean, estimated at 2.2 Ma (0.4–4.8 Ma). The first split in a Hawaiian lineage (P. sandwicensis) matches the age of Necker Island (11.0 Ma), when its ancestor diverged into two species that are distinguished by purple and yellow fruits. Subsequent establishment across the Hawaiian archipelago supports, in part, progression rule colonization. In summary, we found no explanatory power in metapopulation vicariance and conclude that Planchonella has expanded its range across the Pacific by LDD. We contend that this will be seen in many other groups when analyzed in detail.
  • 机译 寻找共同的起源:重新认识同源性
    摘要:Understanding the evolution of biodiversity on Earth is a central aim in biology. Currently, various disciplines of science contribute to unravel evolution at all levels of life, from individual organisms to species and higher ranks, using different approaches and specific terminologies. The search for common origin, traditionally called homology, is a connecting paradigm of all studies related to evolution. However, it is not always sufficiently taken into account that defining homology depends on the hierarchical level studied (organism, population, and species), which can cause confusion. Therefore, we propose a framework to define homologies making use of existing terms, which refer to homology in different fields, but restricting them to an unambiguous meaning and a particular hierarchical level. We propose to use the overarching term “homology” only when “morphological homology,” “vertical gene transfer,” and “phylogenetic homology” are confirmed. Consequently, neither phylogenetic nor morphological homology is equal to homology. This article is intended for readers with different research backgrounds. We challenge their traditional approaches, inviting them to consider the proposed framework and offering them a new perspective for their own research.
  • 机译 完善系统发育的第三次攻击
    摘要:Perfect phylogenies are fundamental in the study of evolutionary trees because they capture the situation when each evolutionary trait emerges only once in history; if such events are believed to be rare, then by Occam’s Razor such parsimonious trees are preferable as a hypothesis of evolution. A classical result states that 2-state characters permit a perfect phylogeny precisely if each subset of 2 characters permits one. More recently, it was shown that for 3-state characters the same property holds but for size-3 subsets. A long-standing open problem asked whether such a constant exists for each number of states. More precisely, it has been conjectured that for any fixed number of states there exists a constant such that a set of -state characters has a perfect phylogeny if and only if every subset of at most characters has a perfect phylogeny. Informally, the conjecture states that checking fixed-size subsets of characters is enough to correctly determine whether input data permits a perfect phylogeny, irrespective of the number of characters in the input. In this article, we show that this conjecture is false. In particular, we show that for any constant , there exists a set of -state characters such that has no perfect phylogeny, but there exists a perfect phylogeny for every subset of at most characters. Moreover, there already exists a perfect phylogeny when ignoring just one of the characters, independent of which character you ignore. This negative result complements the two negative results (“strikes”) of ,. We reflect on the consequences of this third strike, pointing out that while it does close off some routes for efficient algorithm development, many others remain open.
  • 机译 羽毛化石与科学哲学
    摘要:The last half century of paleornithological research has transformed the way that biologists perceive the evolutionary history of birds. This transformation has been driven, since 1969, by a series of exciting fossil discoveries combined with intense scientific debate over how best to interpret these discoveries. Ideally, as evidence accrues and results accumulate, interpretive scientific agreement forms. But this has not entirely happened in the debate over avian origins: the accumulation of scientific evidence and analyses has had some effect, but not a conclusive one, in terms of resolving the question of avian origins. Although the majority of biologists have come to accept that birds are dinosaurs, there is lingering and, in some quarters, strident opposition to this view. In order to both understand the ongoing disagreement about avian origins and generate a prediction about the future of the debate, here we use a revised model of scientific practice to assess the current and historical state of play surrounding the topic of bird evolutionary origins. Many scientists are familiar with the metascientific scholars Sir Karl Popper and Thomas Kuhn, and these are the primary figures that have been appealed to so far, in prior attempts to assess the dispute. But we demonstrate that a variation of Imre Lakatos’s model of progressive versus degenerative research programmes provides a novel and productive assessment of the debate. We establish that a refurbished Lakatosian account both explains the intractability of the dispute and predicts a likely outcome for the debate about avian origins. In short, here, we offer a metascientific tool for rationally assessing competing theories—one that allows researchers involved in seemingly intractable scientific disputes to advance their debates.
  • 机译 使用结构化马尔可夫模型的解剖本体和Evo-Devo的集成建议建模离散表型特征的新框架。
    摘要:Modeling discrete phenotypic traits for either ancestral character state reconstruction or morphology-based phylogenetic inference suffers from ambiguities of character coding, homology assessment, dependencies, and selection of adequate models. These drawbacks occur because trait evolution is driven by two key processes—hierarchical and hidden—which are not accommodated simultaneously by the available phylogenetic methods. The hierarchical process refers to the dependencies between anatomical body parts, while the hidden process refers to the evolution of gene regulatory networks (GRNs) underlying trait development. Herein, I demonstrate that these processes can be efficiently modeled using structured Markov models (SMM) equipped with hidden states, which resolves the majority of the problems associated with discrete traits. Integration of SMM with anatomy ontologies can adequately incorporate the hierarchical dependencies, while the use of the hidden states accommodates hidden evolution of GRNs and substitution rate heterogeneity. I assess the new models using simulations and theoretical synthesis. The new approach solves the long-standing “tail color problem,” in which the trait is scored for species with tails of different colors or no tails. It also presents a previously unknown issue called the “two-scientist paradox,” in which the nature of coding the trait and the hidden processes driving the trait’s evolution are confounded; failing to account for the hidden process may result in a bias, which can be avoided by using hidden state models. All this provides a clear guideline for coding traits into characters. This article gives practical examples of using the new framework for phylogenetic inference and comparative analysis.
  • 机译 不适用数据的形态系统发育分析算法
    摘要:Morphological data play a key role in the inference of biological relationships and evolutionary history and are essential for the interpretation of the fossil record. The hierarchical interdependence of many morphological characters, however, complicates phylogenetic analysis. In particular, many characters only apply to a subset of terminal taxa. The widely used “reductive coding” approach treats taxa in which a character is inapplicable as though the character’s state is simply missing (unknown). This approach has long been known to create spurious tree length estimates on certain topologies, potentially leading to erroneous results in phylogenetic searches—but pratical solutions have yet to be proposed and implemented. Here, we present a single-character algorithm for reconstructing ancestral states in reductively coded data sets, following the theoretical guideline of minimizing homoplasy over all characters. Our algorithm uses up to three traversals to score a tree, and a fourth to fully resolve final states at each node within the tree. We use explicit criteria to resolve ambiguity in applicable/inapplicable dichotomies, and to optimize missing data. So that it can be applied to single characters, the algorithm employs local optimization; as such, the method provides a fast but approximate inference of ancestral states and tree score. The application of our method to published morphological data sets indicates that, compared to traditional methods, it identifies different trees as “optimal.” As such, the use of our algorithm to handle inapplicable data may significantly alter the outcome of tree searches, modifying the inferred placement of living and fossil taxa and potentially leading to major differences in reconstructions of evolutionary history.
  • 机译 一种通用探针组,用于使用k-Medoids聚类设计的任何开花植物中的353个核基因的靶向测序
    摘要:Sequencing of target-enriched libraries is an efficient and cost-effective method for obtaining DNA sequence data from hundreds of nuclear loci for phylogeny reconstruction. Much of the cost of developing targeted sequencing approaches is associated with the generation of preliminary data needed for the identification of orthologous loci for probe design. In plants, identifying orthologous loci has proven difficult due to a large number of whole-genome duplication events, especially in the angiosperms (flowering plants). We used multiple sequence alignments from over 600 angiosperms for 353 putatively single-copy protein-coding genes identified by the One Thousand Plant Transcriptomes Initiative to design a set of targeted sequencing probes for phylogenetic studies of any angiosperm group. To maximize the phylogenetic potential of the probes, while minimizing the cost of production, we introduce a k-medoids clustering approach to identify the minimum number of sequences necessary to represent each coding sequence in the final probe set. Using this method, 5–15 representative sequences were selected per orthologous locus, representing the sequence diversity of angiosperms more efficiently than if probes were designed using available sequenced genomes alone. To test our approximately 80,000 probes, we hybridized libraries from 42 species spanning all higher-order groups of angiosperms, with a focus on taxa not present in the sequence alignments used to design the probes. Out of a possible 353 coding sequences, we recovered an average of 283 per species and at least 100 in all species. Differences among taxa in sequence recovery could not be explained by relatedness to the representative taxa selected for probe design, suggesting that there is no phylogenetic bias in the probe set. Our probe set, which targeted 260 kbp of coding sequence, achieved a median recovery of 137 kbp per taxon in coding regions, a maximum recovery of 250 kbp, and an additional median of 212 kbp per taxon in flanking non-coding regions across all species. These results suggest that the Angiosperms353 probe set described here is effective for any group of flowering plants and would be useful for phylogenetic studies from the species level to higher-order groups, including the entire angiosperm clade itself.
  • 机译 黄金Orbweavers忽略生物学规则:系统发育和比较分析揭示了性大小二态性的复杂演变。
    摘要:Instances of sexual size dimorphism (SSD) provide the context for rigorous tests of biological rules of size evolution, such as Cope’s rule (phyletic size increase), Rensch’s rule (allometric patterns of male and female size), as well as male and female body size optima. In certain spider groups, such as the golden orbweavers (Nephilidae), extreme female-biased SSD (eSSD, female:male body length 2) is the norm. Nephilid genera construct webs of exaggerated proportions, which can be aerial, arboricolous, or intermediate (hybrid). First, we established the backbone phylogeny of Nephilidae using 367 anchored hybrid enrichment markers, then combined these data with classical markers for a reference species-level phylogeny. Second, we used the phylogeny to test Cope and Rensch’s rules, sex specific size optima, and the coevolution of web size, type, and features with female and male body size and their ratio, SSD. Male, but not female, size increases significantly over time, and refutes Cope’s rule. Allometric analyses reject the converse, Rensch’s rule. Male and female body sizes are uncorrelated. Female size evolution is random, but males evolve toward an optimum size (3.2–4.9 mm). Overall, female body size correlates positively with absolute web size. However, intermediate sized females build the largest webs (of the hybrid type), giant female Nephila and Trichonephila build smaller webs (of the aerial type), and the smallest females build the smallest webs (of the arboricolous type). We propose taxonomic changes based on the criteria of clade age, monophyly and exclusivity, classification information content, and diagnosability. Spider families, as currently defined, tend to be between 37 million years old and 98 million years old, and Nephilidae is estimated at 133 Ma (97–146), thus deserving family status. We, therefore, resurrect the family Nephilidae Simon 1894 that contains Clitaetra Simon 1889, the Cretaceous Geratonephila, Herennia Thorell 1877, Indoetra, new rank, Nephila Leach 1815, Nephilengys L. Koch 1872, Nephilingis Kuntner 2013, Palaeonephila Wunderlich 2004 from Tertiary Baltic amber, and Trichonephila, new rank. We propose the new clade Orbipurae to contain Araneidae Clerck 1757, Phonognathidae Simon 1894, new rank, and Nephilidae. Nephilid female gigantism is a phylogenetically ancient phenotype (over 100 Ma), as is eSSD, though their magnitudes vary by lineage.
  • 机译 线粒体基因组片段化结合了欧亚哺乳动物的寄生虱
    摘要:Organelle genome fragmentation has been found in a wide range of eukaryotic lineages; however, its use in phylogenetic reconstruction has not been demonstrated. We explored the use of mitochondrial (mt) genome fragmentation in resolving the controversial suborder-level phylogeny of parasitic lice (order Phthiraptera). There are approximately 5000 species of parasitic lice in four suborders (Amblycera, Ischnocera, Rhynchophthirina, and Anoplura), which infest mammals and birds. The phylogenetic relationships among these suborders are unresolved despite decades of studies. We sequenced the mt genomes of eight species of parasitic lice and compared them with 17 other species of parasitic lice sequenced previously. We found that the typical single-chromosome mt genome is retained in the lice of birds but fragmented into many minichromosomes in the lice of eutherian mammals. The shared derived feature of mt genome fragmentation unites the eutherian mammal lice of Ischnocera (family Trichodectidae) with Anoplura and Rhynchophthirina to the exclusion of the bird lice of Ischnocera (family Philopteridae). The novel clade, namely Mitodivisia, is also supported by phylogenetic analysis of mt genome and cox1 gene sequences. Our results demonstrate, for the first time, that organelle genome fragmentation is informative for resolving controversial high-level phylogenies.
  • 机译 使用后验预测模拟的气动力学模型充分性
    摘要:Rapidly evolving pathogens, such as viruses and bacteria, accumulate genetic change at a similar timescale over which their epidemiological processes occur, such that, it is possible to make inferences about their infectious spread using phylogenetic time-trees. For this purpose it is necessary to choose a phylodynamic model. However, the resulting inferences are contingent on whether the model adequately describes key features of the data. Model adequacy methods allow formal rejection of a model if it cannot generate the main features of the data. We present TreeModelAdequacy, a package for the popular BEAST2 software that allows assessing the adequacy of phylodynamic models. We illustrate its utility by analyzing phylogenetic trees from two viral outbreaks of Ebola and H1N1 influenza. The main features of the Ebola data were adequately described by the coalescent exponential-growth model, whereas the H1N1 influenza data were best described by the birth–death susceptible-infected-recovered model.
  • 机译 等位基因分阶段极大地提高了超保守元件的系统发生效用
    摘要:Advances in high-throughput sequencing techniques now allow relatively easy and affordable sequencing of large portions of the genome, even for nonmodel organisms. Many phylogenetic studies reduce costs by focusing their sequencing efforts on a selected set of targeted loci, commonly enriched using sequence capture. The advantage of this approach is that it recovers a consistent set of loci, each with high sequencing depth, which leads to more confidence in the assembly of target sequences. High sequencing depth can also be used to identify phylogenetically informative allelic variation within sequenced individuals, but allele sequences are infrequently assembled in phylogenetic studies. Instead, many scientists perform their phylogenetic analyses using contig sequences which result from the de novo assembly of sequencing reads into contigs containing only canonical nucleobases, and this may reduce both statistical power and phylogenetic accuracy. Here, we develop an easy-to-use pipeline to recover allele sequences from sequence capture data, and we use simulated and empirical data to demonstrate the utility of integrating these allele sequences to analyses performed under the multispecies coalescent model. Our empirical analyses of ultraconserved element locus data collected from the South American hummingbird genus Topaza demonstrate that phased allele sequences carry sufficient phylogenetic information to infer the genetic structure, lineage divergence, and biogeographic history of a genus that diversified during the last 3 myr. The phylogenetic results support the recognition of two species and suggest a high rate of gene flow across large distances of rainforest habitats but rare admixture across the Amazon River. Our simulations provide evidence that analyzing allele sequences leads to more accurate estimates of tree topology and divergence times than the more common approach of using contig sequences.
  • 机译 第3点:改进的性能,扩展性和可用性,用于统计系统遗传学的高性能计算库
    摘要:BEAGLE is a high-performance likelihood-calculation library for phylogenetic inference. The BEAGLE library defines a simple, but flexible, application programming interface (API), and includes a collection of efficient implementations for calculation under a variety of evolutionary models on different hardware devices. The library has been integrated into recent versions of popular phylogenetics software packages including BEAST and MrBayes and has been widely used across a diverse range of evolutionary studies. Here, we present BEAGLE 3 with new parallel implementations, increased performance for challenging data sets, improved scalability, and better usability. We have added new OpenCL and central processing unit-threaded implementations to the library, allowing the effective utilization of a wider range of modern hardware. Further, we have extended the API and library to support concurrent computation of independent partial likelihood arrays, for increased performance of nucleotide-model analyses with greater flexibility of data partitioning. For better scalability and usability, we have improved how phylogenetic software packages use BEAGLE in multi-GPU (graphics processing unit) and cluster environments, and introduced an automated method to select the fastest device given the data set, evolutionary model, and hardware. For application developers who wish to integrate the library, we also have developed an online tutorial. To evaluate the effect of the improvements, we ran a variety of benchmarks on state-of-the-art hardware. For a partitioned exemplar analysis, we observe run-time performance improvements as high as 5.9-fold over our previous GPU implementation. BEAGLE 3 is free, open-source software licensed under the Lesser GPL and available at .
  • 机译 利用卷积网络的有效特征转移,以专家级的精度自动进行昆虫分类学鉴定
    摘要:Rapid and reliable identification of insects is important in many contexts, from the detection of disease vectors and invasive species to the sorting of material from biodiversity inventories. Because of the shortage of adequate expertise, there has long been an interest in developing automated systems for this task. Previous attempts have been based on laborious and complex handcrafted extraction of image features, but in recent years it has been shown that sophisticated convolutional neural networks (CNNs) can learn to extract relevant features automatically, without human intervention. Unfortunately, reaching expert-level accuracy in CNN identifications requires substantial computational power and huge training data sets, which are often not available for taxonomic tasks. This can be addressed using feature transfer: a CNN that has been pretrained on a generic image classification task is exposed to the taxonomic images of interest, and information about its perception of those images is used in training a simpler, dedicated identification system. Here, we develop an effective method of CNN feature transfer, which achieves expert-level accuracy in taxonomic identification of insects with training sets of 100 images or less per category, depending on the nature of data set. Specifically, we extract rich representations of intermediate to high-level image features from the CNN architecture VGG16 pretrained on the ImageNet data set. This information is submitted to a linear support vector machine classifier, which is trained on the target problem. We tested the performance of our approach on two types of challenging taxonomic tasks: 1) identifying insects to higher groups when they are likely to belong to subgroups that have not been seen previously and 2) identifying visually similar species that are difficult to separate even for experts. For the first task, our approach reached 92% accuracy on one data set (884 face images of 11 families of Diptera, all specimens representing unique species), and 96% accuracy on another (2936 dorsal habitus images of 14 families of Coleoptera, over 90% of specimens belonging to unique species). For the second task, our approach outperformed a leading taxonomic expert on one data set (339 images of three species of the Coleoptera genus Oxythyrea; 97% accuracy), and both humans and traditional automated identification systems on another data set (3845 images of nine species of Plecoptera larvae; 98.6 % accuracy). Reanalyzing several biological image identification tasks studied in the recent literature, we show that our approach is broadly applicable and provides significant improvements over previous methods, whether based on dedicated CNNs, CNN feature transfer, or more traditional techniques. Thus, our method, which is easy to apply, can be highly successful in developing automated taxonomic identification systems even when training data sets are small and computational budgets limited. We conclude by briefly discussing some promising CNN-based research directions in morphological systematics opened up by the success of these techniques in providing accurate diagnostic tools.
  • 机译 系统发生学中的边缘可能性:方法与应用综述
    摘要:By providing a framework of accounting for the shared ancestry inherent to all life, phylogenetics is becoming the statistical foundation of biology. The importance of model choice continues to grow as phylogenetic models continue to increase in complexity to better capture micro- and macroevolutionary processes. In a Bayesian framework, the marginal likelihood is how data update our prior beliefs about models, which gives us an intuitive measure of comparing model fit that is grounded in probability theory. Given the rapid increase in the number and complexity of phylogenetic models, methods for approximating marginal likelihoods are increasingly important. Here, we try to provide an intuitive description of marginal likelihoods and why they are important in Bayesian model testing. We also categorize and review methods for estimating marginal likelihoods of phylogenetic models, highlighting several recent methods that provide well-behaved estimates. Furthermore, we review some empirical studies that demonstrate how marginal likelihoods can be used to learn about models of evolution from biological data. We discuss promising alternatives that can complement marginal likelihoods for Bayesian model choice, including posterior-predictive methods. Using simulations, we find one alternative method based on approximate-Bayesian computation to be biased. We conclude by discussing the challenges of Bayesian model choice and future directions that promise to improve the approximation of marginal likelihoods and Bayesian phylogenetics as a whole.
  • 机译 Phylotocol:促进系统发育的透明度和克服偏见。
    摘要:The integrity of science requires that the process be based on sound experimental design and objective methodology. Strategies that increase reproducibility and transparency in science protect this integrity by reducing conscious and unconscious biases. Given the large number of analysis options and the constant development of new methodologies in phylogenetics, this field is one that would particularly benefit from more transparent research design. Herein, we introduce phylotocol (fi lō ’ta kôl), an a priori protocol-driven approach in which all analyses are planned and documented at the start of a project. The phylotocol template is simple and the implementation options are flexible to reduce administrative burdens and allow researchers to adapt it to their needs without restricting scientific creativity. While the primary goal of phylotocol is to increase transparency and accountability, it has a number of auxiliary benefits including improving study design and reproducibility, enhancing collaboration and education, and increasing the likelihood of project completion. Our goal with this Point of View article is to encourage a dialog about transparency in phylogenetics and the best strategies to bring transparent research practices to our field.
  • 机译 不断变化的生态机会促进了新喀里多尼亚草履虫(唇形科)的爆炸性多样化
    摘要:Phylogenies recurrently demonstrate that oceanic island systems have been home to rapid clade diversification and adaptive radiations. The existence of adaptive radiations posits a central role of natural selection causing ecological divergence and speciation, and some plant radiations have been highlighted as paradigmatic examples of such radiations. However, neutral processes may also drive speciation during clade radiations, with ecological divergence occurring following speciation. Here, we document an exceptionally rapid and unique radiation of Lamiaceae within the New Caledonian biodiversity hotspot. Specifically, we investigated various biological, ecological, and geographical drivers of species diversification within the genus Oxera. We found that Oxera underwent an initial process of rapid cladogenesis likely triggered by a dramatic period of aridity during the early Pliocene. This early diversification of Oxera was associated with an important phase of ecological diversification triggered by significant shifts of pollination syndromes, dispersal modes, and life forms. Finally, recent diversification of Oxera appears to have been further driven by the interplay of allopatry and habitat shifts likely related to climatic oscillations. This suggests that Oxera could be regarded as an adaptive radiation at an early evolutionary stage that has been obscured by more recent joint habitat diversification and neutral geographical processes. Diversification within Oxera has perhaps been triggered by varied ecological and biological drivers acting in a leapfrog pattern, but geographic processes may have been an equally important driver. We suspect that strictly adaptive radiations may be rare in plants and that most events of rapid clade diversification may have involved a mixture of geographical and ecological divergence.
  • 机译 与蛋白质数据集上的其他比对方法相比,评估统计上的多序列比对
    摘要:The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.]
  • 机译 EPA-ng:遗传序列的大规模并行进化布局
    摘要:Next generation sequencing (NGS) technologies have led to a ubiquity of molecular sequence data. This data avalanche is particularly challenging in metagenetics, which focuses on taxonomic identification of sequences obtained from diverse microbial environments. Phylogenetic placement methods determine how these sequences fit into an evolutionary context. Previous implementations of phylogenetic placement algorithms, such as the evolutionary placement algorithm (EPA) included in RAxML, or PPLACER, are being increasingly used for this purpose. However, due to the steady progress in NGS technologies, the current implementations face substantial scalability limitations. Herein, we present EPA-NG, a complete reimplementation of the EPA that is substantially faster, offers a distributed memory parallelization, and integrates concepts from both, RAxML-EPA and PPLACER. EPA-NG can be executed on standard shared memory, as well as on distributed memory systems (e.g., computing clusters). To demonstrate the scalability of EPA-NG, we placed billion metagenetic reads from the Tara Oceans Project onto a reference tree with 3748 taxa in just under h, using 2048 cores. Our performance assessment shows that EPA-NG outperforms RAxML-EPA and PPLACER by up to a factor of in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems. We further show that the distributed memory parallelization of EPA-NG scales well up to 2048 cores. EPA-NG is available under the AGPLv3 license: .
  • 机译 来自基因组数据的完整贝叶斯比较系统记录
    • 作者:Jamie R Oaks
    • 刊名:Systematic Biology
    • 2019年第3期
    摘要:A challenge to understanding biological diversification is accounting for community-scale processes that cause multiple, co-distributed lineages to co-speciate. Such processes predict non-independent, temporally clustered divergences across taxa. Approximate-likelihood Bayesian computation (ABC) approaches to inferring such patterns from comparative genetic data are very sensitive to prior assumptions and often biased toward estimating shared divergences. We introduce a full-likelihood Bayesian approach, ecoevolity, which takes full advantage of information in genomic data. By analytically integrating over gene trees, we are able to directly calculate the likelihood of the population history from genomic data, and efficiently sample the model-averaged posterior via Markov chain Monte Carlo algorithms. Using simulations, we find that the new method is much more accurate and precise at estimating the number and timing of divergence events across pairs of populations than existing approximate-likelihood methods. Our full Bayesian approach also requires several orders of magnitude less computational time than existing ABC approaches. We find that despite assuming unlinked characters (e.g., unlinked single-nucleotide polymorphisms), the new method performs better if this assumption is violated in order to retain the constant characters of whole linked loci. In fact, retaining constant characters allows the new method to robustly estimate the correct number of divergence events with high posterior probability in the face of character-acquisition biases, which commonly plague loci assembled from reduced-representation genomic libraries. We apply our method to genomic data from four pairs of insular populations of Gekko lizards from the Philippines that are not expected to have co-diverged. Despite all four pairs diverging very recently, our method strongly supports that they diverged independently, and these results are robust to very disparate prior assumptions.

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号