The need to operate on sequence data is prevalent across a range of real world applications including protein/DNA classification, speech recognition, intrusion detection, and text classification. Sequence data can be distinguished from the more-typical vector representation in that the length of sequences within a dataset can vary and that the order of symbols within a sequence carries meaning. Although it has become increasingly easy to collect large amounts of sequence data, our ability to infer useful information from these sequences has not kept pace. For instance, in the domain of biological sequences, experimentally determining the order of amino acids in a protein is far easier than determining the protein's physical structure or its role within a living organism. This asymmetry holds over a number of sequence data domains, and, as a result, researchers increasingly rely on computational techniques to infer properties of sequences that are either difficult or costly to collect through direct measurement. The methods I describe in this dissertation attempt to mitigate this asymmetry by advancing state-of-the-art techniques for extracting useful information from sequence data.;The first model I discuss in this thesis combines two types of statistical models, topic models and the Hidden Markov Model, in a novel way. Topic models, like Latent Dirichlet Allocation, make the simplifying assumption that words in a document are independently generated, while Hidden Markov Models assume a pairwise dependency over adjacent elements of a sequence. Our Hidden Markov Model Variant adds the pairwise dependency assumption back into the topic modeling structure. This structural change allows the HMM Variant to be used to extract fixed length representations of variable length sequences by accumulating statistics from the latent portions of the model. These fixed length representations can then be used as input to any number of standard machine learning algorithms that need fixed-length vector inputs. We show that these representations perform well for classifying protein sequences in conjunction with a support vector machine classifier.;The second model discussed in this thesis is an extension of the Profile HMM, a version of the Hidden Markov Model commonly used to represent biological sequences. Our Infinite Profile HMM modifies the basic Profile HMM to allow an infinite number of hidden states. To run inference given this infinite set of hidden states, we introduce a transformation of the model's hidden state space. This transformation allows us to compute an approximate marginal probability using only a finite amount of space by pruning low-probability configurations from the joint distribution. Our inference method not only allows inference for the infinite model but also significantly increases the speed of inference in the standard Profile HMM.;This thesis also covers methods to combine structure from multiple Profile HMMs. To accomplish this task, we first simplify the Profile HMM into a model that we call the Simplified Local Profile HMM (SL-pHMM). Two separate strategies can be used to combine multiple SL-pHMMs into a unified probabilistic model over sequences. The first strategy uses a separate "switching variable" for each element of a sequence. This switching variable selects which individual SL-pHMM generates an associated sequence element. The second strategy, which we call the Factorial SL-pHMM, constructs probability distributions over individual sequence elements using a linear combination of the SL-pHMM hidden states. These strategies can then be further combined with a distribution over sequence labels, allowing the model to both generate sequence elements and the sequence label. We show that both of these strategies are effective for classifying synthetically-generated sets of sequences.;An extension of the Factorial SL-pHMM involves relaxing the hidden state space of the SL-pHMM to a continuous domain. If we place a regularizer that encourages sparsity on this new continuous space, then the new model shares many characteristics with a set of techniques frequently used in computer vision known as Sparse Dictionary Learning. This relaxation is the basis of our Relevant Subsequence Sparse Dictionary Learning (RS-DL) model. Applied to continuous sequences, RS-DL is effective at extracting human-recognizable motifs. In addition, subsequences extracted using RS-DL can improve on classification performance over standard nearest neighbor and dynamic time warping techniques.;The final contributions of this work involve incorporating Hidden Markov Model structure into a family of purely discriminative models. We call these models Subsequence Networks, and they operate by incorporating Profile HMM and Pair HMM structure into the lower level of a neural network. This structure is similar to convolutional neural networks, which have garnered state-of-the-art results in a number of tasks in computer vision. Subsequence Networks are competitive with state-of-the-art sequence Kernel methods for protein sequence classification but use a significantly different mode of operation. (Abstract shortened by UMI.).
展开▼