Noncoding RNAs (ncRNAs) are functional transcripts that do not code for proteins. Many of them play indispensible roles in the cell. For example, the ribosomal RNAs make up the ribosome that is the factory for making proteins and riboswitches bind to small metabolites in the cell and regulate gene expression. Computational discovery of ncRNAs is challenging, however, because ncRNAs evolve rapidly on the nucleotide level while preserving secondary structure. In the first part of this thesis, we develop two clustering algorithms that are robust to weak sequence homology signals and are applicable on the genomic scale. We show that both algorithms can recover most known ncRNA families and as few as 5 homologous sequences are needed to predict a strong motif.;In the second part of the thesis, we investigate whether secondary structure in- formation improves maximum likelihood tree inference for ncRNAs. An accurate phylogenetic tree has important biological and clinical applications: it can be used to infer the function of novel organisms and understand the evolutionary history of species. We show that using structure information, a more realistic gap model, and a maximum likelihood approach improves phylogenetic tree inference.;In the third part of the thesis, we develop a method for profiling human gut microbial communities using high-throughput sequencing. Our method works on Illumina short reads and does not require assembly or taxonomic identification. We show that it can differentiate between the gut microbiota of healthy individuals at low sequencing depth, making it a cost-effective screening tool for large population studies.;In the final part of the thesis, we use a standard additions experiment to examine sequencing bias and errors in Illumina HiSeq. We identify features associated with systematic errors and develop an error correction pipeline. We show that our method reduces base errors and produces better species diversity estimates.
展开▼