首页> 外文期刊>Nucleic Acids Research >Human pol II promoter prediction: time series descriptors and machine learning
【24h】

Human pol II promoter prediction: time series descriptors and machine learning

机译:人类pol II启动子预测:时间序列描述符和机器学习

获取原文
获取原文并翻译 | 示例
           

摘要

Although several in silico promoter prediction methods have been developed to date, they are still limited in predictive performance. The limitations are due to the challenge of selecting appropriate features of promoters that distinguish them from non-promoters and the generalization or predictive ability of the machine-learning algorithms. In this paper we attempt to define a novel approach by using unique descriptors and machine-learning methods for the recognition of eukaryotic polymerase II promoters. In this study, non-linear time series descriptors along with non-linear machine-learning algorithms, such as support vector machine (SVM), are used to discriminate between promoter and non-promoter regions. The basic idea here is to use descriptors that do not depend on the primary DNA sequence and provide a clear distinction between promoter and non-promoter regions. The classification model built on a set of 1000 promoter and 1500 non-promoter sequences, showed a 10-fold cross-validation accuracy of 87% and an independent test set had an accuracy > 85% in both promoter and non-promoter identification. This approach correctly identified all 20 experimentally verified promoters of human chromosome 22. The high sensitivity and selectivity indicates that n-mer frequencies along with non-linear time series descriptors, such as Lyapunov component stability and Tsallis entropy, and supervised machine-learning methods, such as SVMs, can be useful in the identification of pol II promoters.
机译:尽管迄今为止已经开发了几种计算机启动子预测方法,但是它们的预测性能仍然受到限制。局限性是由于选择合适的启动子特征(将启动子与非启动子区别开来)以及机器学习算法的一般性或可预测性带来的挑战。在本文中,我们尝试通过使用独特的描述符和机器学习方法来定义一种识别真核聚合酶II启动子的新颖方法。在这项研究中,非线性时间序列描述符以及非线性机器学习算法(例如支持向量机(SVM))用于区分启动子区域和非启动子区域。这里的基本思想是使用不依赖于一级DNA序列的描述符,并在启动子和非启动子区域之间提供清晰的区分。建立在一组1000个启动子和1500个非启动子序列上的分类模型,显示出10倍的交叉验证准确度为87%,独立测试集在启动子和非启动子识别中的准确度均大于85%。这种方法正确地鉴定了人类染色体22的所有20个经过实验验证的启动子。高灵敏度和选择性表明,n-mer频率以及非线性时间序列描述符(例如Lyapunov分量稳定性和Tsallis熵)以及监督的机器学习方法,例如SVM,可用于鉴定pol II启动子。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号