Much of the complexity of biology lies in the problem of how cells sharing the same, relatively static genetic code produce the vast diversity of cellular states and functions necessary for multicellular life. DNA and RNA are the storage and message molecules, respectively, of genetic information transfer, but these macromolecules have also been co-opted as direct functional players in the control of specific, contextdependent gene regulation. The vastness of sequence space and the challenge of generating quantitative, genome-scale datasets make understanding, predicting, and intervening in these modes of gene regulation daunting.The first part of this work addresses Argonaute family proteins, which load short nucleic acid guides to program specific binding to nucleic acid targets to regulate gene expression, host defense, and other biological functions. We deploy multiple highthroughput sequencing-based assays to measure the association rates, binding affinities, and single turnover cleavage rates for mouse Ago2 loaded with specific RNA guides against >40,000 unique RNA targets. We map sequence to structure to function relationships for Ago2 binding and cleavage, and show that our in vitro measurements can be used to predict gene repression in an engineered cellular system. We next use similar methodologic approaches to study an Argonaute protein derived from the bacterium Thermus thermophilus, TtAgo, that uses DNA guides to bind and cleave DNA targets at extreme temperatures. By measuring the binding of multiple DNA guides against thousands of targets each, we were able to construct general, quantitative models of association kinetics and binding affinity. We also show that guide sequence composition has dramatic effects on cleavage activity, suggesting that only a subset of guides are capable of cleaving targets at physiologically relevant temperatures.In the second part of this work, we examine a different form of gene regulation– the use of enhancers and other cis-regulatory elements to control gene expression in the many distinct cell types comprising human scalp. We generated paired single cell RNAand ATAC-sequencing datasets of primary human scalp. We use these integrated datasets to identify 'highly-regulated genes' linked to a disproportionately large number of enhancers and show that for a given highly-regulated gene expressed in multiple cell types, a greater number of linked enhancers is associated with higher levels of transcription. We demonstrate that genetic variation associated with skin and hair disease is specifically enriched in open chromatin regions of implicated cell types, including a strong association between dermal papilla cells and androgenetic alopecia. Using machine learning approaches, we further prioritize specific genetic variants that putatively disrupt transcription factor binding sites, leading to altered expression at disease-relevant genes.
展开▼
机译:生物学的大部分复杂性在于,共享相同、相对静态的遗传密码的细胞如何产生多细胞生命所必需的细胞状态和功能的广泛多样性。DNA 和 RNA 分别是遗传信息传递的储存分子和信息分子,但这些大分子也被选为控制特定的、依赖于环境的基因调控的直接功能参与者。序列空间的广阔和生成定量基因组规模数据集的挑战使得理解、预测和干预这些基因调控模式变得令人生畏。这项工作的第一部分涉及 Argonaute 家族蛋白,该家族蛋白加载短核酸向导,以编程与核酸靶标的特异性结合,以调节基因表达、宿主防御和其他生物学功能。我们部署了多种基于高通量测序的检测方法,以测量针对 >40,000 个独特 RNA 靶标的小鼠 Ago2 的结合率、结合亲和力和单次周转切割率。我们将 Ago2 结合和切割的序列到结构到功能的关系进行映射,并表明我们的体外测量可用于预测工程细胞系统中的基因抑制。接下来,我们使用类似的方法学方法来研究源自嗜热菌 TtAgo 的 Argonaute 蛋白,该蛋白使用 DNA 向导在极端温度下结合和切割 DNA 靶标。通过测量多个 DNA 向导与数千个靶标的结合,我们能够构建关联动力学和结合亲和力的通用定量模型。我们还表明,向导序列组成对切割活性有显著影响,表明只有一部分向导能够在生理相关温度下切割靶标。在这项工作的第二部分,我们研究了一种不同形式的基因调控——使用增强子和其他顺式调节元件来控制构成人类头皮的许多不同细胞类型的基因表达。我们生成了原代人类头皮的配对单细胞 RNA 和 ATAC 测序数据集。我们使用这些整合数据集来识别与不成比例的大量增强子相关的“高度调节基因”,并表明对于在多种细胞类型中表达的给定高度调节基因,更多的连锁增强子与更高水平的转录相关。我们证明,与皮肤和头发疾病相关的遗传变异特别丰富于相关细胞类型的开放染色质区域,包括真皮细胞与雄激素性脱发之间的强烈关联。使用机器学习方法,我们进一步优先考虑推定破坏转录因子结合位点的特定遗传变异,从而导致疾病相关基因的表达发生改变。
展开▼