首页> 外文期刊>Computational and Structural Biotechnology Journal >Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions
【24h】

Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions

机译:学习RNA和蛋白质序列的分布式表示及其预测LNCRNA - 蛋白质相互作用的应用

获取原文
           

摘要

The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k -mer segmentation, which can be regard as “word” in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.
机译:长度非致rnas(lncrna)在生物体中普遍存在,在各种生物过程和复杂疾病中起着至关重要的作用。新兴的证据表明,LNCRNA与相应的蛋白质相互作用以进行监管职能。因此,鉴定相互作用的LNCRNA-蛋白对是理解LNCRNA功能和机制的第一步。由于通过高通量实验确定LNCRNA蛋白质相互作用是耗时和昂贵的,因此需要开发更强大和准确的计算方法。在这项研究中,我们开发了一种新的序列分布式表示基于潜在的LPI-pred的潜在LNCRNA - 蛋白质相互作用预测的方法,其受到自然语言与生物序列之间的相似性的启发。更具体地,将LNCRNA和蛋白质序列分成K -mer分割,其可以在自然语言处理中视为“字”。然后,我们使用Word2VEC和人类基因组LNCRNA和蛋白质序列训练了RNA2VEC和PRO2VEC模型,以占RNA和蛋白质的挖掘分布表示。然后,通过使用基于GINI信息杂质测量的特征选择来减少复杂特征的尺寸。最后,这些鉴别特征用于训练随机林分类器以预测LNCRNA-蛋白质相互作用。采用五倍的交叉验证来评估三个基准数据集的LPI-P7的性能,包括RPI369,RPI488和RPI2241。结果表明,LPI-Pred可以是为生物学研究提供可靠指导的有用工具。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号