首页> 外文学位 >Latent Variable Models of Sequence Data for Classification and Discovery
【24h】

Latent Variable Models of Sequence Data for Classification and Discovery

机译:分类和发现序列数据的潜在变量模型

获取原文
获取原文并翻译 | 示例

摘要

The need to operate on sequence data is prevalent across a range of real world applications including protein/DNA classification, speech recognition, intrusion detection, and text classification. Sequence data can be distinguished from the more-typical vector representation in that the length of sequences within a dataset can vary and that the order of symbols within a sequence carries meaning. Although it has become increasingly easy to collect large amounts of sequence data, our ability to infer useful information from these sequences has not kept pace. For instance, in the domain of biological sequences, experimentally determining the order of amino acids in a protein is far easier than determining the protein's physical structure or its role within a living organism. This asymmetry holds over a number of sequence data domains, and, as a result, researchers increasingly rely on computational techniques to infer properties of sequences that are either difficult or costly to collect through direct measurement. The methods I describe in this dissertation attempt to mitigate this asymmetry by advancing state-of-the-art techniques for extracting useful information from sequence data.;The first model I discuss in this thesis combines two types of statistical models, topic models and the Hidden Markov Model, in a novel way. Topic models, like Latent Dirichlet Allocation, make the simplifying assumption that words in a document are independently generated, while Hidden Markov Models assume a pairwise dependency over adjacent elements of a sequence. Our Hidden Markov Model Variant adds the pairwise dependency assumption back into the topic modeling structure. This structural change allows the HMM Variant to be used to extract fixed length representations of variable length sequences by accumulating statistics from the latent portions of the model. These fixed length representations can then be used as input to any number of standard machine learning algorithms that need fixed-length vector inputs. We show that these representations perform well for classifying protein sequences in conjunction with a support vector machine classifier.;The second model discussed in this thesis is an extension of the Profile HMM, a version of the Hidden Markov Model commonly used to represent biological sequences. Our Infinite Profile HMM modifies the basic Profile HMM to allow an infinite number of hidden states. To run inference given this infinite set of hidden states, we introduce a transformation of the model's hidden state space. This transformation allows us to compute an approximate marginal probability using only a finite amount of space by pruning low-probability configurations from the joint distribution. Our inference method not only allows inference for the infinite model but also significantly increases the speed of inference in the standard Profile HMM.;This thesis also covers methods to combine structure from multiple Profile HMMs. To accomplish this task, we first simplify the Profile HMM into a model that we call the Simplified Local Profile HMM (SL-pHMM). Two separate strategies can be used to combine multiple SL-pHMMs into a unified probabilistic model over sequences. The first strategy uses a separate "switching variable" for each element of a sequence. This switching variable selects which individual SL-pHMM generates an associated sequence element. The second strategy, which we call the Factorial SL-pHMM, constructs probability distributions over individual sequence elements using a linear combination of the SL-pHMM hidden states. These strategies can then be further combined with a distribution over sequence labels, allowing the model to both generate sequence elements and the sequence label. We show that both of these strategies are effective for classifying synthetically-generated sets of sequences.;An extension of the Factorial SL-pHMM involves relaxing the hidden state space of the SL-pHMM to a continuous domain. If we place a regularizer that encourages sparsity on this new continuous space, then the new model shares many characteristics with a set of techniques frequently used in computer vision known as Sparse Dictionary Learning. This relaxation is the basis of our Relevant Subsequence Sparse Dictionary Learning (RS-DL) model. Applied to continuous sequences, RS-DL is effective at extracting human-recognizable motifs. In addition, subsequences extracted using RS-DL can improve on classification performance over standard nearest neighbor and dynamic time warping techniques.;The final contributions of this work involve incorporating Hidden Markov Model structure into a family of purely discriminative models. We call these models Subsequence Networks, and they operate by incorporating Profile HMM and Pair HMM structure into the lower level of a neural network. This structure is similar to convolutional neural networks, which have garnered state-of-the-art results in a number of tasks in computer vision. Subsequence Networks are competitive with state-of-the-art sequence Kernel methods for protein sequence classification but use a significantly different mode of operation. (Abstract shortened by UMI.).
机译:在许多实际应用中,包括蛋白质/ DNA分类,语音识别,入侵检测和文本分类,都需要处理序列数据。序列数据可以与更典型的矢量表示形式区分开,因为数据集内序列的长度可以变化,并且序列内符号的顺序带有含义。尽管收集大量序列数据变得越来越容易,但是我们从这些序列中推断出有用信息的能力并未跟上步伐。例如,在生物学序列的领域中,通过实验确定蛋白质中氨基酸的顺序要比确定蛋白质的物理结构或其在活生物体内的作用容易得多。这种不对称性保留了许多序列数据域,结果,研究人员越来越依赖于计算技术来推断难以通过直接测量收集的序列或昂贵的序列属性。我在本文中描述的方法试图通过先进的技术来从序列数据中提取有用的信息来减轻这种不对称性。本文讨论的第一个模型结合了两种类型的统计模型:主题模型和主题模型。隐藏的马尔可夫模型,以一种新颖的方式。主题模型(如潜在Dirichlet分配)采用简化的假设,即文档中的单词是独立生成的,而隐马尔可夫模型则假设序列中相邻元素成对依赖。我们的隐马尔可夫模型变体将成对依赖假设重新添加到主题建模结构中。通过累积来自模型潜在部分的统计信息,此结构更改允许将HMM变体用于提取可变长度序列的固定长度表示形式。然后,可以将这些固定长度表示形式用作需要固定长度向量输入的任何数量的标准机器学习算法的输入。我们证明了这些表示与支持向量机分类器一起用于蛋白质序列分类的效果很好。本论文中讨论的第二个模型是Profile HMM的扩展,Profile HMM是通常用于表示生物学序列的Hidden Markov模型的一种版本。我们的无限配置文件HMM修改了基本配置文件HMM以允许无限数量的隐藏状态。为了在给定无限数量的隐藏状态的情况下进行推断,我们引入了模型的隐藏状态空间的转换。通过从联合分布中修剪低概率配置,此变换使我们能够仅使用有限的空间来计算近似的边际概率。我们的推理方法不仅允许对无限模型进行推理,而且可以显着提高标准Profile HMM中的推理速度。为了完成此任务,我们首先将Profile HMM简化为一个称为简化本地Profile HMM(SL-pHMM)的模型。可以使用两种不同的策略将多个SL-pHMM组合成序列上的统一概率模型。第一种策略对序列的每个元素使用单独的“切换变量”。该切换变量选择哪个单独的SL-pHMM生成关联的序列元素。第二种策略,我们称为阶乘SL-pHMM,它使用SL-pHMM隐藏状态的线性组合在各个序列元素上构建概率分布。然后,可以将这些策略与序列标签上的分布进一步组合,从而允许模型生成序列元素和序列标签。我们表明这两种策略对分类合成的序列集都是有效的。阶乘SL-pHMM的扩展涉及将SL-pHMM的隐藏状态空间松弛到连续域。如果我们在这个新的连续空间上放置鼓励稀疏性的正则化工具,那么新模型将通过计算机稀疏词典学习中经常使用的一组技术共享许多特征。这种放松是我们相关子序列稀疏词典学习(RS-DL)模型的基础。应用于连续序列,RS-DL可有效提取人类可识别的基序。另外,使用RS-DL提取的子序列可以改善分类性能,优于标准的最近邻和动态时间规整技术。这项工作的最后贡献是将隐马尔可夫模型结构纳入纯判别模型族。我们称这些模型为子序列网络,它们通过将Profile HMM和Pair HMM结构合并到神经网络的较低层中来进行操作。这种结构类似于卷积神经网络,它们已经获得了最先进的技术,可以完成许多计算机视觉任务。子序列网络与用于蛋白质序列分类的最新序列内核方法相比具有竞争优势,但使用的操作模式却大不相同。 (摘要由UMI缩短。)。

著录项

  • 作者

    Blasiak, Samuel J.;

  • 作者单位

    George Mason University.;

  • 授予单位 George Mason University.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 210 p.
  • 总页数 210
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号