首页> 外文期刊>Journal of Integrative Bioinformatics >Probabilistic Latent Semantic Analysis Applied to Whole Bacterial Genomes Identifies Common Genomic Features
【24h】

Probabilistic Latent Semantic Analysis Applied to Whole Bacterial Genomes Identifies Common Genomic Features

机译:应用于整个细菌基因组的概率潜在语义分析可确定常见的基因组特征

获取原文
       

摘要

The spread of drug resistance amongst clinically-important bacteria is a serious, and growing, problem [1]. However, the analysis of entire genomes requires considerable computational effort, usually including the assembly of the genome and subsequent identification of genes known to be important in pathology. An alternative approach is to use computational algorithms to identify genomic differences between pathogenic and non-pathogenic bacteria, even without knowing the biological meaning of those differences. To overcome this problem, a range of techniques for dimensionality reduction have been developed. One such approach is known as latent-variable models [2]. In latent-variable models dimensionality reduction is achieved by representing a high-dimensional data by a few hidden or latent variables, which are not directly observed but inferred from the observed variables present in the model. Probabilistic Latent Semantic Indexing (PLSA) is an extention of LSA [3]. PLSA is based on a mixture decomposition derived from a latent class model. The main objective of the algorithm, as in LSA, is to represent high-dimensional co-occurrence information in a lower-dimensional way in order to discover the hidden semantic structure of the data using a probabilistic framework. In this work we applied the PLSA approach to analyse the common genomic features in methicillin resistant Staphylococcus aureus, using tokens derived from amino acid sequences rather than DNA. We characterised genome-scale amino acid sequences in terms of their components, and then investigated the relationships between genomes and tokens and the phenotypes they generated. As a control we used the non-pathogenic model Gram-positive bacterium Bacillus subtilis.
机译:耐药性在临床上很重要的细菌中的传播是一个严重且正在增长的问题[1]。但是,整个基因组的分析需要大量的计算工作,通常包括基因组的组装和随后鉴定出在病理学中很重要的基因。一种替代方法是使用计算算法来识别致病细菌和非致病细菌之间的基因组差异,即使不知道这些差异的生物学意义。为了克服这个问题,已经开发了用于降低尺寸的一系列技术。一种这样的方法称为潜变量模型[2]。在潜在变量模型中,降维是通过用一些隐藏或潜在变量表示高维数据来实现的,这些变量不是直接观察到的,而是从模型中存在的观察变量推断出来的。概率潜在语义索引(PLSA)是LSA的延伸[3]。 PLSA基于从潜在类模型得出的混合物分解。与LSA中一样,该算法的主要目标是以低维方式表示高维共现信息,以便使用概率框架发现数据的隐藏语义结构。在这项工作中,我们使用PLSA方法,使用衍生自氨基酸序列而不是DNA的标记来分析耐甲氧西林金黄色葡萄球菌的常见基因组特征。我们根据其组成对基因组规模的氨基酸序列进行了表征,然后研究了基因组与标记之间的关系以及它们产生的表型。作为对照,我们使用了非致病性模型的革兰氏阳性枯草芽孢杆菌。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号