Finding genetic variants that regulate gene expression now plays a central role in the analysis of mechanism in biological systems. This will also increasingly be the case as large amounts of gene expression and genetic marker data are generated by next-generation sequencing technologies. While the unprecedented scale of these data is providing the opportunity for scientists to answer basic questions about biological systems, the properties of these data raise analysis challenges, particularly in terms of covariate modeling. For example, expression levels of thousands of genes are usually measured in batches and different batches may be measured under different conditions, which creates the well known batch effect. Besides this artificially created factor that can affect the quality of the measurement, expression data often reflect environmental regulators that change the gene expression levels, such as smoking, drug usage etc. These sources of confounding need to be addressed either before or during analysis of data.;In this thesis, I address the analysis issues raised by a particular type of confounding in high-dimensional data: hidden factor effects. Hidden factors are defined as factors that contribute to variation in a large number of measured variables where there is no direct information concerning the factors in the data. It is critical to correct for the hidden factors because if ignored, they can lead to either high false positive rates or reduced power. To tackle this issue, I propose to use a statistical model that combines multivariate ridge regression and factor analysis to infer both the fixed effects and the hidden confounding. The method is unique in the sense that it employs the multivariate regression components to infer the associations between the response Y and the covariate X, while it maintains efficiency by sharing the same data reduction property with the factor analysis model. Compared to other models that address the same issue, this model can successfully partition the covariance structure of the hidden factors, which dramatically improves the power and the accuracy of detecting the real associations between X and Y. I also used the model to address the hidden factors issues in the analysis of data on gene expression levels measured in the airway of the lung in a sample of people, in the context of a genome association study, referred to as an expression Quantitative Trait Loci (eQTL) analysis. I show that the method successfully eliminates the false positives caused by spurious structures (hidden factors) and greatly improves the power to detect true genetic determinants (the eQTL) that regulate gene expression in the lung airway. I also apply the method to a challenging Genotype-Environment Interaction (GEI) analysis, where GEI effects are defined as the dependence of genotype-phenotype relationships on environmental factors. I show that despite the small sample size and the highly complicated data structure, with my method, I can identify a large number of interesting GEI associations, many have been verified independently by other studies to be highly relevant genes to lung disease and lung functions. These GEI associations contain more information than a typical eQTL because they help to identify genetic regulators that show different behavior under different environmental pressures, which serve as an interesting set of gene candidates for clinical scientists.
展开▼