class='head no_bottom_margin' id='sec1title'>Int'/> Inferring Disease-Associated MicroRNAs Using Semi-supervised Multi-Label Graph Convolutional Networks
首页> 美国卫生研究院文献>iScience >Inferring Disease-Associated MicroRNAs Using Semi-supervised Multi-Label Graph Convolutional Networks
【2h】

Inferring Disease-Associated MicroRNAs Using Semi-supervised Multi-Label Graph Convolutional Networks

机译:使用半监督的多标签图卷积网络推断疾病相关的microRNA。

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

class="head no_bottom_margin" id="sec1title">IntroductionMicroRNAs (miRNAs) are a type of small non-coding RNAs with a size of about 22 nucleotides, and they interact with other RNAs to play important roles in transcriptional and post-transcriptional gene regulation (). It is estimated that over 60% of all human protein-coding genes (PCGs) are regulated by miRNAs (), and these miRNAs have been implicated in diseases. To date, the associations between diseases and PCGs are well investigated; many disease-PCG associations have been discovered and collected in public databases, e.g., DISEASES (), OUGene (), and DisGeNET (). Compared with PCG's well-known important roles in diseases, the studies of effects of miRNAs are increasing. With increasing high-throughput sequencing data generated, more and more miRNAs are being discovered, and experimentally identifying their functions is costly and time consuming. Thus, it is imperative to develop computational methods to identify functional miRNA biomarkers associated with diseases, especially using rich information buried in disease-associated PCGs.Some miRNAs are mainly expressed in certain tissues and show tissue specificity (), which have certain tissue-specific expression patterns associated with diseases (). They are expected to behave similarly to other disease-associated genes like PCGs or long non-coding RNAs (lncRNAs). Thus, several existing computational methods have used tissue expression data to infer gene-disease associations. For instance, GeneTIER makes use of disease-tissue associations to prioritize disease candidate genes (). NetWAS identifies disease-associated genes by combining tissue-specific interaction networks and genome-wide association studies (). Especially, some methods use tissue expression profiles with machine learning models to infer disease-associated lncRNAs. For example, DislncRF trains machine learning models on tissue expression profiles of disease-associated PCGs and further applies the trained models to infer disease-associated lncRNAs (). All the above-mentioned studies demonstrated that tissue expression profiles indeed can facilitate the detection of disease-gene associations.On the other hand, interaction networks contain rich clues for linking miRNAs to diseases. Many computational methods have been developed under the context of gene-gene networks (). For example, Jiang el al. integrate miRNA and disease similarity network and miRNA-disease association to prioritize disease candidate miRNAs using a network-based approach (); midp applies random walk on the interaction network to infer disease-associated miRNAs (). Similarly, RWRMDA implements random walk on the miRNA functional similarity network to link miRNAs to diseases (); the MDHGI integrates the predicted association score based on sparse learning method to infer disease-associated miRNAs (). More closely related studies are as follows: DRMDA applies stacked autoencoder to learn deep representation for predicting miRNA-disease association (href="#bib8" rid="bib8" class=" bibr popnode">Chen et al., 2018a), LRSSLMDA and PBMDA use Laplacian regularized sparse subspace learning and path-based computational model for miRNA-disease association prediction (href="#bib7" rid="bib7" class=" bibr popnode">Chen and Huang, 2017, href="#bib43" rid="bib43" class=" bibr popnode">You et al., 2017), BNPMDA uses Bipartite Network Projection based on the known miRNA-disease associations (href="#bib9" rid="bib9" class=" bibr popnode">Chen et al., 2018b), and KBMF-MDI employs kernelized Bayesian matrix factorization to score miRNA-disease associations by integrating disease and miRNA similarity (href="#bib22" rid="bib22" class=" bibr popnode">Lan et al., 2018). Similarly, DLRMC infers disease-associated miRNAs using dual Laplacian regularized matrix completion (href="#bib37" rid="bib37" class=" bibr popnode">Tang et al., 2019). FamCluRank applies non-negative matrix factorization on the heterogeneous network with node attributes to predict disease-associated miRNAs (href="#bib41" rid="bib41" class=" bibr popnode">Xuan et al., 2018b). Especially deep learning has been utilized to extract deep representation for disease-miRNA association prediction (href="#bib40" rid="bib40" class=" bibr popnode">Xuan et al., 2018a).A common hypothesis for the above methods is they assume that similar miRNAs can be associated with the same disease and similar diseases would be associated with the same miRNA. Thus, they commonly train and evaluate the models with representations of miRNAs and diseases as inputs on verified disease-miRNA associations through cross-validation approach.However, as pointed out in href="#bib24" rid="bib24" class=" bibr popnode">Lehtinen et al. (2015), in the context of gene function prediction, cross-validation may be problematic because some gene-function associations are not independent in the benchmark set. There exists the same issue for disease-miRNA associations due to the following: (1) miRNAs from the same family may be associated with the same disease, (2) disease-associated miRNAs from miRNA-target assay may be derived from the targets that these miRNAs interact with, and (3) the associated miRNAs of child diseases are related to the miRNAs of parent diseases in disease ontology. When training and evaluating the models using cross-validation, randomly dividing the disease-miRNA associations may cause dependent associations to be separated into the training and test sets, potentially leading to an overestimated predictive performance. Cross-validating miRNA-disease associations may not actually reflect the method's ability to predict new miRNA-disease associations, but rather which information is dissipated in the benchmark set. In addition, as reported in href="#bib33" rid="bib33" class=" bibr popnode">Park and Marcotte (2012), there may exist flaws in cross-validation for computational pair-input prediction. One disease or one miRNA may be associated with multiple miRNAs or diseases, so randomly dividing disease-miRNA pairs into training and test sets will make some pairs in the test set share either the miRNA or the disease with the pairs in the training set, which causes the trained models to not generalize well to unseen disease-miRNA associations.Thus, during cross-validation, complicated steps are required to make sure that dependent samples are divided into the same training set or the same test set and that pairs in the training and test sets do not share the miRNA or disease. It is almost impossible to construct a completely independent test set. An alternative strategy is that we do not use disease-miRNA associations for model training. For instance, instead of using miRNA-disease associations, the miRPD approach combines PCG-disease associations and miRNA-PCG network to score miRNAs and diseases (href="#bib27" rid="bib27" class=" bibr popnode">Mork et al., 2014). This has triggered us to further investigate disease-miRNA associations based on an interaction network. To date, there exist many high-confidence disease-PCG associations, and one miRNA may share the same disease with its PCG targets; we will be capable of transferring PCG-associated diseases to miRNAs on an interaction network under a new semi-supervised framework.Recently, deep learning has achieved remarkable results in computational biology (href="#bib3" rid="bib3" class=" bibr popnode">Angermueller et al., 2016, href="#bib13" rid="bib13" class=" bibr popnode">Ching et al., 2018), especially convolutional neural networks (CNNs) (href="#bib23" rid="bib23" class=" bibr popnode">Lecun et al., 1998). CNNs can capture local correlation buried in data and mainly consist of convolutional layers, pooling layers, and fully connected layers. Many studies have demonstrated that the CNN networks are powerful in learning the hidden patterns from complicated biological data. For example, DeepBind (href="#bib1" rid="bib1" class=" bibr popnode">Alipanahi et al., 2015) and DeepSEA (href="#bib45" rid="bib45" class=" bibr popnode">Zhou and Troyanskaya, 2015) apply CNNs to predict preference of DNA/RNA-binding proteins and the impact of non-coding variants, respectively. iDeep (href="#bib29" rid="bib29" class=" bibr popnode">Pan and Shen, 2017) and iDeepE (href="#bib30" rid="bib30" class=" bibr popnode">Pan and Shen, 2018) further improve the performance of predicting RNA-binding protein (RBP)-binding sites and motifs using hybrid CNNs. The iDeepS (href="#bib31" rid="bib31" class=" bibr popnode">Pan et al., 2018) identifies binding sequence and structure preferences of RBPs simultaneously using CNNs and long short-term memory network.Although the CNN has shown its power, it cannot handle structured datasets, like gene-gene networks. To analyze these types of network data, graph convolutional networks (GCNs) have been developed (href="#bib15" rid="bib15" class=" bibr popnode">Defferrard et al., 2016, href="#bib18" rid="bib18" class=" bibr popnode">Hamilton et al., 2017, href="#bib21" rid="bib21" class=" bibr popnode">Kipf and Welling, 2017). Under the framework of spectral graph convolutions, it encodes both local graph structure and features of nodes. The GCNs have been used on the graph data to predict polypharmacy side effects, where the graph is a multimodal graph constructed from protein-protein interactions, drug-protein interactions, and the polypharmacy side effects (href="#bib46" rid="bib46" class=" bibr popnode">Zitnik et al., 2018). The GCN is a graph-based semi-supervised learning method that does not require labels for all nodes. This setting is especially powerful for inferring miRNA-associated diseases, because many miRNAs are not well investigated about their associations with diseases and many disease-PCG associations are available. Compared with traditional semi-supervised methods (href="#bib19" rid="bib19" class=" bibr popnode">Jia et al., 2016, href="#bib38" rid="bib38" class=" bibr popnode">Wan and Wang, 2019, href="#bib44" rid="bib44" class=" bibr popnode">Zhang et al., 2018, href="#bib47" rid="bib47" class=" bibr popnode">Zoidi et al., 2018), GCNs can capture the structural information within the node's local network, similar to CNNs in images. In addition, one PCG or miRNA can be associated with multiple diseases. Thus, we can formulate the prediction of disease-miRNA associations as a multi-label classification problem.In this study, we present a new semi-supervised multi-label learning method, DimiG, based on GCNs to integrate multiple networks of PCG-PCG interactions, PCG-miRNA interactions, PCG-disease associations, and tissue expression profiles to infer miRNA-associated diseases. The DimiG does not require the disease-miRNA associations, and it is trained on the graph consisting of PCG-PCG and miRNA-PCG interactions, where only PCGs have labeled diseases. Then DimiG is further used to score associations between diseases and miRNAs.This study has made the following four major contributions for understanding disease-miRNA associations. (1) We further demonstrate that cross-validation performance of methods trained on known disease-miRNA associations could be overestimated and may not be able to reflect the method's actual ability to predict new disease-miRNA associations. We have proposed a network-based knowledge transfer approach for this problem. Considering that an miRNA may share the same disease with its PCG targets and there exist many high-confidence disease-PCG associations, we will be able to transfer the PCG-associated diseases to miRNAs in an interaction network framework. (2) We have formulated disease-miRNA association prediction as a semi-supervised multi-label node classification in a graph, which can help learn the complex networks composed of unlabeled miRNAs and labeled PCGs and the multi-label associations. This is a new prediction protocol for this problem. (3) We use semi-supervised GCN to learn patterns from PCG-associated diseases on an interaction network, which are further used to score diseases and miRNAs. This GCN-based approach combines the advantages of deep learning for representation learning and network-based methods. (4) We have further incorporated the domain knowledge into our model construction. Considering that miRNAs are often expressed in a tissue-specific way, we integrate the expression profiles across tissues into our GCN framework. Our results demonstrate that informative signals in more tissues can be captured for aiding the inference of disease-associated miRNAs.
机译:<!-fig ft0-> <!-fig @ position =“ anchor” mode =文章f4-> <!-fig mode =“ anchred” f5-> <!-fig / graphic | fig / alternatives / graphic mode =“ anchored” m1-> class =“ head no_bottom_margin” id =“ sec1title”>简介 MicroRNA(miRNA)是一种小型非编码RNA,大小约为22个核苷酸,它们与其他RNA相互作用,在转录和转录后基因调控中发挥重要作用()。据估计,所有人类蛋白质编码基因(PCG)中有超过60%受miRNA()调控,并且这些miRNA与疾病有关。迄今为止,已经对疾病与PCG之间的关联进行了深入研究。已发现许多疾病PCG关联并将其收集在公共数据库中,例如DISEASES(),OUGene()和DisGeNET()。与PCG在疾病中众所周知的重要作用相比,miRNA的作用研究正在增加。随着高通量测序数据的产生,越来越多的miRNA被发现,并且通过实验鉴定其功能既昂贵又费时。因此,迫切需要开发一种计算方法来鉴定与疾病相关的功能性miRNA生物标志物,尤其是利用与疾病相关的PCG中所隐藏的丰富信息。一些miRNA主要在某些组织中表达并显示出组织特异性(),具有某些组织特异性与疾病相关的表达方式()。预期它们的行为与其他与疾病相关的基因,例如PCG或长的非编码RNA(lncRNA)相似。因此,几种现有的计算方法已经使用组织表达数据来推断基因-疾病关联。例如,GeneTIER利用疾病与组织的关联对疾病候选基因进行优先排序()。 NetWAS通过结合组织特异性相互作用网络和全基因组关联研究来鉴定与疾病相关的基因。特别是,某些方法将组织表达谱与机器学习模型结合使用,以推断疾病相关的lncRNA。例如,DislncRF在与疾病相关的PCG的组织表达概况上训练机器学习模型,并进一步将训练后的模型应用于与疾病相关的lncRNA的推断。所有上述研究表明,组织表达谱确实可以促进疾病基因关联的检测。另一方面,相互作用网络包含将miRNA与疾病连接的丰富线索。在基因-基因网络的背景下已经开发了许多计算方法。例如,江等。使用基于网络的方法整合miRNA和疾病相似性网络以及miRNA疾病关联,以优先考虑候选疾病的miRNA(); midp在互动网络上应用随机游动以推断与疾病相关的miRNA()。同样,RWRMDA在miRNA功能相似性网络上实施随机游走,以将miRNA与疾病联系起来(); MDHGI基于稀疏学习方法整合了预测的关联评分,以推断疾病相关的miRNA()。更紧密相关的研究如下:DRMDA应用堆叠式自动编码器来学习用于预测miRNA疾病关联的深度表示(href="#bib8" rid="bib8" class=" bibr popnode"> Chen et al。,2018a < / a>),LRSSLMDA和PBMDA使用拉普拉斯正则化的稀疏子空间学习和基于路径的计算模型进行miRNA疾病关联预测(href="#bib7" rid="bib7" class=" bibr popnode"> Chen and Huang ,2017 ,href="#bib43" rid="bib43" class=" bibr popnode">您等,2017 ),BNPMDA使用基于已知miRNA的Bipartite网络投影-疾病关联(href="#bib9" rid="bib9" class=" bibr popnode"> Chen et al。,2018b ),KBMF-MDI利用核化的贝叶斯矩阵分解对miRNA-疾病关联进行评分通过整合疾病和miRNA的相似性(href="#bib22" rid="bib22" class=" bibr popnode"> Lan等人,2018 )。类似地,DLRMC使用双重拉普拉斯正则化矩阵完成来推断与疾病相关的miRNA(href="#bib37" rid="bib37" class=" bibr popnode"> Tang等人,2019 )。 FamCluRank在具有节点属性的异构网络上应用非负矩阵分解以预测与疾病相关的miRNA(href="#bib41" rid="bib41" class=" bibr popnode"> Xuan et al。,2018b )。尤其是深度学习已被用于提取深度表示的疾病-miRNA关联预测(href="#bib40" rid="bib40" class=" bibr popnode"> Xuan等人,2018a )。上述方法的共同假设是,他们假设相似的miRNA可能与相同的疾病相关,而相似的疾病可能与相同的miRNA相关。因此,他们通常通过交叉验证方法训练和评估以miRNA和疾病表示为模型的模型,作为对经过验证的疾病-miRNA关联的输入。,如href="#bib24" rid="bib24" class=" bibr popnode"> Lehtinen等人所指出。 (2015),在基因功能预测的背景下,交叉验证可能会出现问题,因为某些基因功能关联在基准集中并不独立。由于以下原因,疾病与miRNA的关联存在相同的问题:(1)来自同一家族的miRNA可能与同一疾病相关;(2)来自miRNA-target分析的与疾病相关的miRNA可能来源于目标这些miRNA与之相互作用,并且(3)儿童疾病的相关miRNA与疾病本体中父母疾病的miRNA相关。当使用交叉验证对模型进行训练和评估时,将疾病与miRNA的关联随机划分可能会使依赖的关联分为训练和测试集,从而可能导致高估了预测性能。交叉验证的miRNA疾病关联可能实际上并未反映该方法预测新的miRNA疾病关联的能力,而是在基准集中耗散了哪些信息。此外,如href="#bib33" rid="bib33" class=" bibr popnode"> Park and Marcotte(2012)所述,在计算对输入的交叉验证中可能存在缺陷预测。一种疾病或一种miRNA可能与多种miRNA或疾病相关,因此将疾病miRNA对随机分为训练和测试集将使测试集中的某些对与训练集中的对共享miRNA或疾病。导致训练后的模型无法很好地推广到未知的疾病-miRNA关联。因此,在交叉验证期间,需要复杂的步骤来确保将相关样本分为相同的训练集或相同的测试集,并在训练中配对和测试集不共享miRNA或疾病。构建一个完全独立的测试集几乎是不可能的。另一种策略是我们不使用疾病-miRNA关联进行模型训练。例如,代替使用miRNA疾病关联,miRPD方法结合了PCG疾病关联和miRNA-PCG网络来对miRNA和疾病进行评分(href="#bib27" rid="bib27" class=" bibr popnode"> Mork等,2014 )。这触发了我们根据相互作用网络进一步研究疾病-miRNA的关联。迄今为止,存在许多高信度疾病-PCG关联,并且一种miRNA可能与其PCG靶标共享同一疾病。我们将能够在新的半监督框架下通过交互网络将PCG相关疾病转移到miRNA。最近,深度学习在计算生物学上取得了显著成果(href =“#bib3” rid =“ bib3”类=“ bibr popnode”> Angermueller等人,2016 ,href="#bib13" rid="bib13" class=" bibr popnode"> Ching等人,2018 ),尤其是卷积神经网络(CNN)(href="#bib23" rid="bib23" class=" bibr popnode"> Lecun et al。,1998 )。 CNN可以捕获隐藏在数据中的局部相关性,并且主要由卷积层,池化层和完全连接的层组成。许多研究表明,CNN网络可从复杂的生物数据中学习隐藏模式。例如,DeepBind(href="#bib1" rid="bib1" class=" bibr popnode"> Alipanahi et al。,2015 )和DeepSEA(href =“#bib45” rid =“ bib45“ class =” bibr popnode“> Zhou和Troyanskaya,2015 )将CNN分别用于预测DNA / RNA结合蛋白的偏好和非编码变体的影响。 iDeep(href="#bib29" rid="bib29" class=" bibr popnode">潘和申,2017 )和iDeepE(href =“#bib30” rid =“ bib30” class = “ bibr popnode“> Pan和Shen,2018 )进一步提高了使用杂化CNN预测RNA结合蛋白(RBP)结合位点和基序的性能。 iDeepS(href="#bib31" rid="bib31" class=" bibr popnode"> Pan et al。,2018 )使用CNN和长期短期识别RBP的结合序列和结构偏好尽管CNN已经显示出其强大的功能,但它无法处理结构化的数据集,例如基因-基因网络。为了分析这些类型的网络数据,已经开发了图卷积网络(GCN)(href="#bib15" rid="bib15" class=" bibr popnode"> Defferrard等,2016 ,< a href =“#bib18” rid =“ bib18” class =“ bibr popnode”>汉密尔顿等人,2017 ,href="#bib21" rid="bib21" class=" bibr popnode"> Kipf and Welling,2017年)。在频谱图卷积的框架下,它对局部图结构和节点特征进行编码。 GCN已用于图形数据上以预测多药副作用,其中图形是由蛋白质-蛋白质相互作用,药物-蛋白质相互作用构成的多峰图,以及多药房的副作用(href="#bib46" rid="bib46" class=" bibr popnode"> Zitnik等人,2018 )。 GCN是一种基于图的半监督学习方法,不需要为所有节点添加标签。该设置对于推断与miRNA相关的疾病特别有效,因为许多miRNA与疾病的关联尚未得到很好的研究,并且许多疾病与PCG的关联都可以使用。与传统的半监督方法相比(href="#bib19" rid="bib19" class=" bibr popnode"> Jia et al。,2016 ,href =“#bib38” rid =“ bib38“ class =” bibr popnode“> Wan and Wang,2019 ,href="#bib44" rid="bib44" class=" bibr popnode"> Zhang等人,2018 , href="#bib47" rid="bib47" class=" bibr popnode"> Zoidi等人,2018 ),GCN可以捕获节点本地网络中的结构信息,类似于图像中的CNN。此外,一种PCG或miRNA可能与多种疾病有关。因此,我们可以将疾病-miRNA关联的预测公式化为多标签分类问题。在这项研究中,我们提出了一种新的基于GCN的半监督多标签学习方法DimiG,以整合PCG-PCG的多个网络相互作用,PCG-miRNA相互作用,PCG-疾病关联和组织表达谱来推断与miRNA相关的疾病。 DimiG不需要疾病-miRNA关联,并且在由PCG-PCG和miRNA-PCG相互作用组成的图表上进行训练,其中只有PCG具有标记的疾病。然后DimiG进一步用于对疾病与miRNA之间的关联进行评分。这项研究为理解疾病与miRNA的关联做出了以下四个主要贡献。 (1)我们进一步证明,在已知疾病-miRNA关联上训练的方法的交叉验证性能可能被高估,并且可能无法反映该方法预测新疾病-miRNA关联的实际能力。我们针对此问题提出了一种基于网络的知识转移方法。考虑到miRNA可能与其PCG靶标共享同一疾病,并且存在许多高信度疾病-PCG关联,我们将能够在交互网络框架中将PCG相关疾病转移到miRNA。 (2)我们将疾病-miRNA关联预测公式化为图形中的半监督多标记节点分类,这可以帮助学习由未标记的miRNA和标记的PCG组成的复杂网络以及多标记关联。这是针对此问题的新预测协议。 (3)我们使用半监督GCN在交互网络上从PCG相关疾病中学习模式,进一步将其用于疾病和miRNA评分。这种基于GCN的方法结合了深度学习在表示学习和基于网络的方法方面的优势。 (4)我们将领域知识进一步整合到我们的模型构建中。考虑到miRNA通常以组织特异性的方式表达,我们将整个组织的表达谱整合到我们的GCN框架中。我们的结果表明,可以捕获更多组织中的信息信号,以帮助推断与疾病相关的miRNA。

著录项

  • 期刊名称 iScience
  • 作者

    Xiaoyong Pan; Hong-Bin Shen;

  • 作者单位
  • 年(卷),期 2019(20),-1
  • 年度 2019
  • 页码 265–277
  • 总页数 29
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号