This dissertation is a comprehensive study of the statistical properties of short nucleic acid subsequences found in bacterial genomes. This work revealed that a correlation exists between the frequency a short DNA subsequence has in bacterial genomes and the total number of its one-mismatch (substitution only) combinations. Moreover, the correlations are independent across the full range of G+C content observed in the bacterial genomic sequences studied. This has profound implications for the evolutionary dynamics of bacteria, implying similar rates of replication and mutation within these genomes. The pattern of consistent correlations is not reproduced by sequences simulated from a zero-order Markov model.;A group of mostly intracellular organisms was found to have the presence of multiple distinct clusters in their subsequence frequency versus one-mismatch neighbor scatter plots; this was not observed in the genomes of other bacteria having similar length and G+C content. The clustering was found to be an effect of lower total variance in the high-order transition matrix describing the sequence. This observation implies that the genomes of intracellular bacteria are more constrained than that of free-living bacteria.;Two frameworks for the generation of simulated genomes have been presented, both based on third-order Markov models. Markov models parameterized by a particular genome recreate the general shape for subsequence frequency versus one-mismatch neighbor plots at the third-order. A method for generating de novo Markov models capable of synthesizing a sequence having similar statistical properties to that of a bacterial genome was developed. The model was parameterized only by desired sequence length, G+C content and variance. It was observed that a relatively small amount of variance added to the third-order transition matrix describing a zero-order Markov process can generate a sequence with statistical properties similar to a bacterial genomic sequence with the same length and G+C content. Variations between the simulated and genomic sequences, such as outliers observed in the subsequence frequency versus one-mismatch neighbor plots for bacterial genomes but not reproduced by simulated sequences, highlight several subsequences of well-known biological interest. This novel sequence model can serve as a better null-hypothesis and variations from such can be considered as possible features of biological significance.
展开▼