首页> 外文学位 >Contrast Learning on ChIP-Seq Data of Transcription Factors.
【24h】

Contrast Learning on ChIP-Seq Data of Transcription Factors.

机译:转录因子ChIP-Seq数据的对比学习。

获取原文
获取原文并翻译 | 示例

摘要

In this study, we analyzed the TF ChIP-Seq data of 105 (i.e., 15 choose 2) pairs. Each pair is based on two TF and three binding-dependent (BD) sequence datasets. The BD were generated from the two TF ChIP-Seq datasets in each pair. That is, the three scenario datasets are containing TFBS sequences of type 1, 2 or both (i.e., 1 and 2) TF.;The objective is to identify motif 1, 2 or even both (i.e., interactive motifs) by contrasting two of the three BD datasets at a time by using the contrast-motif-finder (CMF) algorithm. Each of the CMF's output not only provides estimated consensus motifs based on its full name PWM but also provides likelihood ratios (LRs) as a measure of the enrichment of an identified motif. Using this idea, we construct a dataset where the first column lists the locations of identified enriched motif in the genome, column 2 to n+1 contains the estimated consensus motifs and the last column shows a binary (i.e., 0/1) of which set it is from and n is the number of consensus motifs.;Once these datasets are obtained, we use statistical model such as logistics regression, support vector machine (SVM) and classification tree models to determine their performance (i.e., error rates) and selection power. We have shownthat the SVM Radial kernel seems to have the best performance when using all the motifs in the dataset whereas classification tree selects the fewest motifs in almost every analyzed datasets but at the same time, the error rates and selection power do not drop as much. As a result, we believe the classification tree model is a better model since it not only provides a competitive predictive power with simpler models but also takes far less computational time than the other two models.
机译:在这项研究中,我们分析了105对(即15个选择2对)的TF ChIP-Seq数据。每对基于两个TF和三个绑定依赖(BD)序列数据集。 BD是从每对中的两个TF ChIP-Seq数据集中生成的。也就是说,这三个方案数据集包含类型为1、2或两者(即1和2)TF的TFBS序列;目标是通过对比两个主题中的两个,即主题1、2或什至两者(即交互式主题)。通过使用对比图元查找器(CMF)算法,一次可以获取三个BD数据集。 CMF的每个输出不仅基于其全名PWM提供了估计的共有图案,而且还提供了似然比(LRs)作为已识别图案丰富度的度量。使用此思想,我们构建了一个数据集,其中第一列列出了基因组中已鉴定的丰富基序的位置,第2到n + 1列包含了估计的共有基序,最后一列显示了其中的二进制值(即0/1)一旦获得这些数据集,我们就使用统计模型(例如物流回归,支持向量机(SVM)和分类树模型)确定其性能(即错误率),并选择力。我们已经表明,当使用数据集中的所有模体时,SVM径向核似乎具有最佳性能,而分类树在几乎每个分析的数据集中选择最少的模体,但同时,错误率和选择力不会下降那么多。结果,我们认为分类树模型是更好的模型,因为它不仅提供了具有较简单模型的竞争性预测能力,而且还比其他两个模型花费了更少的计算时间。

著录项

  • 作者

    Lee, Yuju.;

  • 作者单位

    University of California, Los Angeles.;

  • 授予单位 University of California, Los Angeles.;
  • 学科 Bioinformatics.;Mathematics.;Computer science.
  • 学位 M.S.
  • 年度 2014
  • 页码 73 p.
  • 总页数 73
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号